The internet is amazing. I used to keep this page updated manually with datasets that I find over the internet. Fast forward to today, TensorFlow Datasets and HuggingFace datasets is a much better place to look for ML data now. These 2 places contain large collections of different datasets, code to use them efficiently in your deep learning library of choice and can download most of the datasets automatically. My old lists is now mostly obsolete.
Collections
Writing a data pipeline with god performance can be time consuming. It would do you good to check if the dataset that you're interested in working on is already prepared for you in one of these collections. They are fairly huge and cover a very wide range of tasks.
For image tasks, torchvision
also have a collection of image datasets available in its API.
Not in Collections (yet)
- DENSE Dataset:
- Link here
- Self-driving, adverse weather, multiple sensor
- Oxford robot car + radar
- Link here
- Self-driving, LIDAR, 360 scanning FMCW radar, camera
- RaDICaL Dataset:
- This is my own work
- Links: Paper, Dataset DOI, Project
- RGB-D, Raw radar, IMU
Archived as of 2020
Types
Image Datasets
MNIST
Probably the most well-known (and overused) datasets in machine learning. 28x28 binary handwritten digits with labels. Popular ML tools like Tensorflow and comes with wrappers to download these. Originals available here.
CIFAR
2 versions, 10 classes and 100 classes, commonly referred to as CIFAR-10 and CIFAR-100 respectively. This is a labelled subset of the Tiny Image Dataset.
60000 32x32 color images of things of various classes. In CIFAR-10, classes are exclusive, in CIFAR-100, there are 20 super classes and 5 sub-classes each. Number of examples per class are equal.
Tiny Images Dataset
79,302,017 tiny (32x32) images. Not all are natural images, might be graphs or drawn figures. Annotated but not equally distributed.
ImageNet
Another wildly popular image dataset. Also the dataset used for the ISLVRC.
Probably the largest and most feature complete image dataset.
- Hierarchical labels according to WordNet
- Bounding boxes for objects
- Account creation required for download.
- Also available on Kaggle
CelebA
202,599 images of 10,177 public individuals. Original images were obtained from the internet. Dropbox link are often unusable due to the popularity of it. Use the Google drive or Baidu Drive links that the authors provided at the top of the page instead.
Also available as Kaggle dataset(cropped and aligned only)
Kaggle API: kaggle datasets download -d jessicali9530/celeba-dataset
A high resolution variant (1024x1024) is used in the Progressive GAN paper. Obtaining it is somewhat involved. This repo contains scripts to help download and create the high resolution images.
Audio Datasets
Speech
VCTK Corpus
High quality recordings of 109 native English speakers with different accents.
48kHz Version
96kHz Version
Common Voice Dataset
An ongoing project by Mozilla. English is mostly usable at the time of writing. Multiple other languages launched but not yet fully validated.
Music
Beethoven Sonatas
Not really a dataset, but some recordings of Beethoven's 32 Sonatas found in public domain (https://archive.org). About 10 hours of non-vocal piano music.
First used (that I know of) in Mehri et. al. SampleRNN(ICLR2017).
Script to download these audio files available here [github]
Million Song Dataset
Audio clips not included in the archives
Contains audio features and tags. Python script to download preview clips available but there are limitations.
Miscellaneous
Not really datasets, but websites with (free) data to download and definately worth a look.