- Computer vision
A collection of free datasets for computer vision projects. Part 1.March 16, 2022
While working on computer vision projects you have probably faced the need for datasets to use in your data science or machine learning experiments - from data labeling to AI model training.
We put together a useful list of computer vision datasets that you can download for free.
1. Open Images
First released in 2017, this dataset contains ~9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives.
- 15,851,536 boxes on 600 categories
- 2,785,498 instance segmentations on 350 categories
- 3,284,280 relationship annotations on 1,466 relationships
- 675,155 localized narratives
- 59,919,574 image-level labels on 19,957 categories
- Extension - 478,000 crowdsourced images with 6,000+ categories
A collection of large-scale, high-quality datasets of URL links of ~650,000 video clips. Depending on the dataset version, you get 400, 600, and 700 human action classes respectively. ‘Human-object’ interactions as well as ‘human-human’ interactions are included in the videos. Each clip is human annotated with a single action class and has a length of ~10 seconds. These datasets can be used for training and exploring neural network architectures for modeling human actions in video.
Statistics on the number of video clips per class for different Kinetics datasets.
|Dataset||# Clips||# Classes||Average||Minimum|
This image database was created to benchmark object recognition and contains more than 14 million images. It is organized according to the WordNet® hierarchy introduced by Princeton University. Each node of the hierarchy is depicted by hundreds or thousands of images. The project has been instrumental in advancing computer vision and deep learning research.
4. MS COCO
Microsoft COCO (Common Objects in Context) is a large-scale object detection, segmentation, and captioning dataset. This dataset was created with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization.
The dataset contains:
- Object segmentation
- Recognition in context
- Superpixel stuff segmentation
- 330K images with more than 200K labeled
- 1.5 million object instances
- 80 object categories
- 91 stuff categories
- 5 captions per image
- 250,000 people with keypoints
Originally called Berkley DeepDrive, BDD100K is a diverse driving dataset for heterogeneous multitask learning. It is the largest driving video dataset with 100K videos and 10 tasks to evaluate the progress of image recognition algorithms on autonomous driving. BDD100K includes a diverse set of driving videos under various weather conditions, time, and scene types. The dataset also comes with a rich set of annotations: scene tagging, object bounding box, lane marking, drivable area, full-frame semantic and instance segmentation, multiple object tracking, and multiple object tracking with segmentation.
Mapillary Vistas is a diverse street-level imagery dataset with pixel-accurate and instance-specific human annotations for understanding street scenes around the world.
It contains images from all around the world, captured at various conditions regarding weather, season, and daytime. Images come from different imaging devices (mobile phones, tablets, action cameras, professional capturing rigs) and differently-experienced photographers.
This dataset includes:
- 25,000 high-resolution images
- 124 semantic object categories
- 100 instance-specifically annotated categories
- Global reach, covering 6 continents
- Variety of weather, season, time of day, camera, and viewpoint
YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs. It has high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. Introduced in 2016, the YouTube-8M originally had 8.2M videos (with 4,800 classes) and got its name based on this number. After several releases this dataset gained higher-quality, more topical annotation and the annotation vocabulary went through a clean up process. A number of low-frequency or low-quality labels and associated videos were removed, resulting in a smaller but higher-quality dataset (6.1M videos, 3,862 classes).
Youtube-8M Segments was the latest addition and has segment-level annotations. Human-verified labels on about 237K segments on 1000 classes are collected from the validation set of the YouTube-8M dataset. Each video comes with time-localized frame-level features so classifier predictions can be made at segment-level granularity.
The CIFAR-10 and CIFAR-100 are popular computer vision datasets made as labeled subsets of the now-depricated “Tiny Images” dataset (80 million images). CIFAR-10 — often used for object recognition - consists of 60,000 32×32 color images divided into 10 classes (6,000 images per class). It has five training batches and one test batch, each with 10,000 images, which means there are 50,000 training images and 10,000 test images.
CIFAR-100 consists of 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs).
This dataset can be used as a benchmark for evaluation of articulated human pose estimation. It includes around 25K images containing over 40K people with annotated body joints. MPII Human Pose covers 410 human activities and each image is provided with an activity label. Each image was extracted from a YouTube video and provided with preceding and following un-annotated frames.
LabelMe database is a large collection of images with ground truth labels for object detection and recognition. The most applicable use of LabelMe is in computer vision research. LabelMe has 187,240 images, 62,197 annotated images, and 658,992 labeled objects.
The LabelMe-12-50k dataset consists of 50,000 images (40,000 for training and 10,000 for testing) in JPEG format. All images were divided into 12 classes with each image being 256x256 pixels in size. The images were extracted from LabelMe.
This is a large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities (several seasons and various weather conditions). With a focus on semantic understanding of urban street scenes, it has high-quality, pixel-level annotations of 5,000 frames in addition to a larger set of 20,000 weakly annotated frames.
A large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations, covering large pose variations and background clutter. Has large diversities, large quantities, and rich annotations including 10K of identities and 202K of face images. It also has additional subsets CelebAMask-HQ, CelebA-Spoof and CelebA-Dialog. They all can be used as the training and test sets for the face attribute recognition, face recognition, face detection, landmark (or facial part) localization, and face editing and synthesis tasks.
IMDB-WIKI dataset is one of the largest open-source datasets of face images with gender and age labels for training. There are a total of 523,051 face images in this dataset. 460,723 face images were obtained from 20,284 celebrities listed on IMDB and 62,328 from Wikipedia.
Fashion-MNIST is a computer vision image dataset that consists of a training set of 60,000 examples and a test set of 10,000 examples. In this dataset, each example is a 28×28 grayscale image, associated with a label from 10 classes. There is an automatic benchmarking system based on Scikit-learn that covers 129 classifiers with different parameters.
A large-scale image dataset that was built using deep learning with humans in the loop. LSUN contains around one million labeled images for each of 10 scene categories and 20 object categories.
17. Visual Genome
The Visual Genome dataset consists of seven main components: region descriptions, objects, attributes, relationships, region graphs, scene graphs, and question answer pairs. It has a total of 108,077 images with 3.8M of objects. Each image in the dataset consists of an average of 35 objects, each delineated by a tight bounding box. Each image in Visual Genome has an average of 26 attributes.
Visual Genome numbers:
- 108,077 Images
- 5.4 Million Region Descriptions
- 1.7 Million Visual Question Answers
- 3.8 Million Object Instances
- 2.8 Million Attributes
- 2.3 Million Relationships
18. MIT Places
A scene-centric dataset called Places, with 205 scene categories and 2.5 million images with a category label.
Columbia Object Image Library (COIL-100) contains 7,200 images of 100 objects. Each object was turned on a turntable through 360 degrees to vary object pose with respect to a fixed color camera. Images of the objects were taken at pose intervals of 5 degrees. This corresponds to 72 poses per object. Their images were then size-normalized. Objects have a wide variety of complex geometric and reflectance characteristics.
20. Stanford Dogs
The Stanford Dogs dataset contains images of 120 breeds of dogs from around the world. This dataset was built using images and annotations from ImageNet for the task of fine-grained image categorization.
Contents of this dataset:
- Number of categories: 120
- Number of images: 20,580
- Annotations: Class labels, Bounding boxes
This is one of the largest publicly available datasets of overhead imagery. xView contains images from complex scenes around the world, annotated using bounding boxes.
- 1.0M Object Instances
- 60 Classes
- 0.3 meter Resolution
- 1,415 km2
Visual Question Answering (VQA) is a dataset that contains open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer.
- 265,016 images (COCO and abstract scenes)
- At least 3 questions (5.4 questions on average) per image
- 10 ground truth answers per question
- 3 plausible (but likely incorrect) answers per question
- Automatic evaluation metric
Build your own dataset for computer vision
These datasets can be a valuable jumping-off point for your own computer vision projects. While public datasets are often used for training and tuning AI models, they can also be a valuable source of data for more specialized projects. You may, for example, be interested in the same objects but need to add different tags or other metadata. Or you may need to label specific features of objects contained in the dataset. Or you may be interested in objects that are likely to appear in the same images or videos but were not labeled in the public dataset. As an example, we recently used a dataset with images of grocery products, but instead of the pre-labeled packaging images, we were interested in identifying empty slots on the shelves, a category that did not exist in the dataset. After importing the dataset to Lodestar, we labeled the empty columns and trained an AI model that could detect when a product was out of stock.
In each of these cases, importing the dataset into Lodestar, refining it to meet your needs, then filtering and exporting it, can help overcome the daunting task of collecting images and video for your project. Just be sure to check the licensing terms for the dataset to make sure you have permission to use the dataset for your purposes.