Data Sets: Data, Data Everywhere and not enough to think!

Robb Sinclair
April 23, 2021
Blogs

Working in a large team of over 60 data architects, data scientists, data analysts, and application developers, we see, touch and feel data all the time. We must augment the data that we have, and we need a library of good sources of new data. 

Data is everywhere. It is like water. But, just like water, if it is not moving it goes stale. If it is an isolated set, it is limited in its purpose – just as a glass of water quenches thirst but might not put out a fire. But, if data is the new water and everyone needs it, where is it? While it’s extremely important to know where data is, it’s as important – if not more important – to know where other data can be found. If you’ve been working in data analytics or data science, you likely have a favorite go-to source for data, whether it’s Public Datasets, Open Data, FreeData, or something else.  

If you are unfamiliar with or unsure of what you would do with new data and you’re drowning in your own data, here are some examples of how we use this open world to augment what we can do with our clients’ data on its own: 

  1. Train models – We use train models when we need volume, when we need to extend the available data, or when we just don’t have the real data available while modelling AI or analytic models. This form of rapid prototype can get a key concept to the client easier than a thousand PowerPoints; Visualized data that simulates real data sets- modelled in a tool of choice and presented well- is a powerful influencer for ideas. Plus, we get the feedback we need to build better models, better solutions, and better outcomes. 

  2. Blending our data with open data – In this instance, a single source of data will be limited. Adding other sets of data that can be matched to existing data will add new understanding and provide new capabilities.  

  3. Validating data – Many (dare we say all) systems contain poor quality of data. Sometimes, we can align our data with data sets that will improve quality. The easy use case here is address data. This will not necessarily match the person to the address, but will rather verify if the address is valid (e.g. the city is in the right state).  

  4. Mastering data – Take the above scenario where we let bad data in and have to fix it later. Mastering data turns this around and uses data to validate as we are capturing new input (e.g. CA could be used as a state – California – in some instances, but also as a country code – Canada – in others). 

  5. Building out a comprehensive view of a specific view of data – This analyzes our product relative to others in the market using comparatives and substitution engine data. It provides a 360-degree view of customer and their buying journeys.  

  6. Finding new insights – Blending multiple data sets will allow for some unique quantitative analysis. Without all the data, the pattern does not emerge.  

Here is a list of some of the datasets available. As you can see, there is an ocean of data that is free, gated, requires a fee/donation, or requires a paid subscription. I do not endorse, get paid from, or am sponsored by any of the available open data sets. I also do not represent the quality of the site, the organizations, or the quality and accuracy of the data sets outlined.  

Hopefully, knowing where to get some of this data and some of the use cases will help you. You still need to know how to use the data and identify the business outcomes that you are trying to understand and change, but hopefully this blog gives you some ideas, expands the oceans to explore, and has you setting off on new data adventures. 

If your favorite open dataset is not listed – send back comments. If you find this helpful and do something cool with this new and expanded power – let me know. If you are new to this world and want to sit down and discuss how your data can come alive by extending it – please connect. 

As you embark into this sea of available open data sets – Bon Voyage!  

Google Datasets:
https://datasetsearch.research.google.com/

Google BIG Query: 
https://cloud.google.com/bigquery/public-data/

Kaggle:
https://www.kaggle.com/datasets

US Government Data: 
https://www.data.gov/

Canadian Government Data:
https://open.canada.ca/en/open-data

Canadian GeoSpatial Data: 
https://canadiangis.com/data.php

Statistics Canada Open Data:
https://www.statcan.gc.ca/eng/microdata/data-centres/data

Ontario Data:
https://data.ontario.ca/

Canadas Space Agency:
https://www.asc-csa.gc.ca/eng/open-data/default.asp

City of Toronto Data:
https://www.toronto.ca/city-government/data-research-maps/open-data/

Data Hub Collections:
https://datahub.io/collections

UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets.php

NASA Earth Data:
https://earthdata.nasa.gov/

CERN Open Data Portal:
http://opendata.cern.ch/

WHO Global Health Observation Data Repository:
http://opendata.cern.ch/

British Film Institute:
https://www.bfi.org.uk/industry-data-insights

NYC Taxi Rides:
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

FBI Crime Data Explorer:
https://crime-data-explorer.fr.cloud.gov/

Socrata open Data Set:
https://opendata.socrata.com/ 

GIT Hub Awesome Data – Public Sets:
https://github.com/awesomedata/awesome-public-datasets

GIT Hub API:
https://docs.github.com/en/rest

Academic Torrents:
https://academictorrents.com/browse.php 

Quandl Financial Data:
https://www.quandl.com/search

Data is Plural:
https://tinyletter.com/data-is-plural

AWS Open Datasets:
https://registry.opendata.aws/

Wikipedia Data:
https://en.wikipedia.org/wiki/Wikipedia:Database_download

DataWorld:
https://data.world/

The World Bank Data:
https://data.worldbank.org/

Reddit Datasets:
https://www.reddit.com/r/datasets/top/?sort=top&t=all

Twitter Data:
https://developer.twitter.com/en/docs

Weather Underground:
https://www.wunderground.com/login

Drugs:
https://datadryad.org/stash/dataset/doi:10.5061/dryad.8q0s4

Follow Us

Recent Posts

Harnessing Next-Generation Mainframe Storage with IBM DS8000

Today, IBM has announced the next version of their DS8000 family enterprise-level block storage device. The first two models to be released, titled DS8A10 and DS8A50, are follow-ons to the DS8910 and DS8950 models.  Operating at the very high end of the block...

AI’s Paradigm Shift

In 18 months, we have gone from AI being emergent to vendors embedding it in every facet of technology. So much so that hardware manufacturers are redesigning products to accommodate on-board AI/LLM processing capabilities. In light of the rapidly changing landscape...

The Shape of Work to Come 

No one will deny that technology has radically changed “work” over the past century. The last 40 years have continually evolved from manual to automated processes and from physical to digital interactions. Letters and interoffice mail have become emails, phone calls...

Want To Read More?

Categories

You May Also Like…

Let’s Talk