Machine Learning Datasets

The internet is amazing. I used to keep this page updated manually with datasets that I find over the internet. Fast forward to today, TensorFlow Datasets and HuggingFace datasets is a much better place to look for ML data now. These 2 places contain large collections of different datasets, code to use them efficiently in your deep learning library of choice and can download most of the datasets automatically. My old lists is now mostly obsolete.

Collections

Writing a data pipeline with god performance can be time consuming. It would do you good to check if the dataset that you're interested in working on is already prepared for you in one of these collections. They are fairly huge and cover a very wide range of tasks.

TensorFlow Datasets
HuggingFace Datasets
Datasets and evaluation metrics for natural language processing Compatible with NumPy, Pandas, PyTorch and TensorFlow 🤗Datasets is a lightweight and extensib...

For image tasks, torchvision also have a collection of image datasets available in its API.

torchvision.datasets — Torchvision 0.8.1 documentation

Not in Collections (yet)

  1. DENSE Dataset:
    • Link here
    • Self-driving, adverse weather, multiple sensor
  2. Oxford robot car + radar
    • Link here
    • Self-driving, LIDAR, 360 scanning FMCW radar, camera
  3. RaDICaL Dataset:

Archived as of 2020

Types

  1. Images
  2. Audio
    1. Speech
    2. Music
  3. Misc

Image Datasets

MNIST

Probably the most well-known (and overused) datasets in machine learning. 28x28 binary handwritten digits with labels. Popular ML tools like Tensorflow and comes with wrappers to download these. Originals available here.

CIFAR

2 versions, 10 classes and 100 classes, commonly referred to as CIFAR-10 and CIFAR-100 respectively. This is a labelled subset of the Tiny Image Dataset.
60000 32x32 color images of things of various classes. In CIFAR-10, classes are exclusive, in CIFAR-100, there are 20 super classes and 5 sub-classes each. Number of examples per class are equal.

Tiny Images Dataset

79,302,017 tiny (32x32) images. Not all are natural images, might be graphs or drawn figures. Annotated but not equally distributed.

ImageNet

Another wildly popular image dataset. Also the dataset used for the ISLVRC.
Probably the largest and most feature complete image dataset.

  • Hierarchical labels according to WordNet
  • Bounding boxes for objects
  • Account creation required for download.
  • Also available on Kaggle

CelebA

202,599 images of 10,177 public individuals. Original images were obtained from the internet. Dropbox link are often unusable due to the popularity of it. Use the Google drive or Baidu Drive links that the authors provided at the top of the page instead.

Also available as Kaggle dataset(cropped and aligned only)
Kaggle API: kaggle datasets download -d jessicali9530/celeba-dataset

A high resolution variant (1024x1024) is used in the Progressive GAN paper. Obtaining it is somewhat involved. This repo contains scripts to help download and create the high resolution images.

Audio Datasets

Speech

VCTK Corpus

High quality recordings of 109 native English speakers with different accents.
48kHz Version
96kHz Version

Common Voice Dataset

An ongoing project by Mozilla. English is mostly usable at the time of writing. Multiple other languages launched but not yet fully validated.

Music

Beethoven Sonatas

Not really a dataset, but some recordings of Beethoven's 32 Sonatas found in public domain (https://archive.org). About 10 hours of non-vocal piano music.
First used (that I know of) in Mehri et. al. SampleRNN(ICLR2017).
Script to download these audio files available here [github]

Million Song Dataset

Audio clips not included in the archives
Contains audio features and tags. Python script to download preview clips available but there are limitations.

Miscellaneous

Not really datasets, but websites with (free) data to download and definately worth a look.

  1. Open Science Data Cloud
  2. UCI Machine Learning Repository
  3. Kaggle