Working in a large team of over 60 data architects, data scientists, data analysts, and application developers, we see, touch and feel data all the time. We must augment the data that we have, and we need a library of good sources of new data. 

Data is everywhere. It is like water. But, just like water, if it is not moving it goes stale. If it is an isolated set, it is limited in its purpose – just as a glass of water quenches thirst but might not put out a fire. But, if data is the new water and everyone needs it, where is it? While it’s extremely important to know where data is, it’s as important – if not more important – to know where other data can be found. If you’ve been working in data analytics or data science, you likely have a favorite go-to source for data, whether it’s Public Datasets, Open Data, FreeData, or something else.  

If you are unfamiliar with or unsure of what you would do with new data and you’re drowning in your own data, here are some examples of how we use this open world to augment what we can do with our clients’ data on its own: 

  1. Train models – We use train models when we need volume, when we need to extend the available data, or when we just don’t have the real data available while modelling AI or analytic models. This form of rapid prototype can get a key concept to the client easier than a thousand PowerPoints; Visualized data that simulates real data sets- modelled in a tool of choice and presented well- is a powerful influencer for ideas. Plus, we get the feedback we need to build better models, better solutions, and better outcomes. 

  2. Blending our data with open data – In this instance, a single source of data will be limited. Adding other sets of data that can be matched to existing data will add new understanding and provide new capabilities.  

  3. Validating data – Many (dare we say all) systems contain poor quality of data. Sometimes, we can align our data with data sets that will improve quality. The easy use case here is address data. This will not necessarily match the person to the address, but will rather verify if the address is valid (e.g. the city is in the right state).  

  4. Mastering data – Take the above scenario where we let bad data in and have to fix it later. Mastering data turns this around and uses data to validate as we are capturing new input (e.g. CA could be used as a state – California – in some instances, but also as a country code – Canada – in others). 

  5. Building out a comprehensive view of a specific view of data – This analyzes our product relative to others in the market using comparatives and substitution engine data. It provides a 360-degree view of customer and their buying journeys.  

  6. Finding new insights – Blending multiple data sets will allow for some unique quantitative analysis. Without all the data, the pattern does not emerge.  

Here is a list of some of the datasets available. As you can see, there is an ocean of data that is free, gated, requires a fee/donation, or requires a paid subscription. I do not endorse, get paid from, or am sponsored by any of the available open data sets. I also do not represent the quality of the site, the organizations, or the quality and accuracy of the data sets outlined.  

Hopefully, knowing where to get some of this data and some of the use cases will help you. You still need to know how to use the data and identify the business outcomes that you are trying to understand and change, but hopefully this blog gives you some ideas, expands the oceans to explore, and has you setting off on new data adventures. 

If your favorite open dataset is not listed – send back comments. If you find this helpful and do something cool with this new and expanded power – let me know. If you are new to this world and want to sit down and discuss how your data can come alive by extending it – please connect. 

As you embark into this sea of available open data sets – Bon Voyage!  

Google Datasets:
https://datasetsearch.research.google.com/

Google BIG Query: 
https://cloud.google.com/bigquery/public-data/

Kaggle:
https://www.kaggle.com/datasets

US Government Data: 
https://www.data.gov/

Canadian Government Data:
https://open.canada.ca/en/open-data

Canadian GeoSpatial Data: 
https://canadiangis.com/data.php

Statistics Canada Open Data:
https://www.statcan.gc.ca/eng/microdata/data-centres/data

Ontario Data:
https://data.ontario.ca/

Canadas Space Agency:
https://www.asc-csa.gc.ca/eng/open-data/default.asp

City of Toronto Data:
https://www.toronto.ca/city-government/data-research-maps/open-data/

Data Hub Collections:
https://datahub.io/collections

UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets.php

NASA Earth Data:
https://earthdata.nasa.gov/

CERN Open Data Portal:
http://opendata.cern.ch/

WHO Global Health Observation Data Repository:
http://opendata.cern.ch/

British Film Institute:
https://www.bfi.org.uk/industry-data-insights

NYC Taxi Rides:
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

FBI Crime Data Explorer:
https://crime-data-explorer.fr.cloud.gov/

Socrata open Data Set:
https://opendata.socrata.com/ 

GIT Hub Awesome Data – Public Sets:
https://github.com/awesomedata/awesome-public-datasets

GIT Hub API:
https://docs.github.com/en/rest

Academic Torrents:
https://academictorrents.com/browse.php 

Quandl Financial Data:
https://www.quandl.com/search

Data is Plural:
https://tinyletter.com/data-is-plural

AWS Open Datasets:
https://registry.opendata.aws/

Wikipedia Data:
https://en.wikipedia.org/wiki/Wikipedia:Database_download

DataWorld:
https://data.world/

The World Bank Data:
https://data.worldbank.org/

Reddit Datasets:
https://www.reddit.com/r/datasets/top/?sort=top&t=all

Twitter Data:
https://developer.twitter.com/en/docs

Weather Underground:
https://www.wunderground.com/login

Drugs:
https://datadryad.org/stash/dataset/doi:10.5061/dryad.8q0s4