List of datasets in computer vision and image processing

This is a list of datasets for machine learning research. It is part of the list of datasets for machine-learning research. These datasets consist primarily of images or videos for tasks such as object detection, facial recognition, and multi-label classification.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
MNIST	Database of grayscale handwritten digits.		60,000	image, label	classification	1994	^[1]	LeCun et al.
Extended MNIST	Database of grayscale handwritten digits and letters.		810,000	image, label	classification	2010	^[2]	NIST
NYU Object Recognition Benchmark (NORB)	Stereoscopic pairs of photos of toys in various orientations.	Centering, perturbation.	97,200 image pairs (50 uniform-colored toys under 36 angles, 9 azimuths, and 6 lighting conditions)	Images	Object recognition	2004	^[3]^[4]	LeCun et al.
80 Million Tiny Images	80 million 32×32 images labelled with 75,062 non-abstract nouns.		80,000,000	image, label		2008	^[5]	Torralba et al.
Street View House Numbers (SVHN)	630,420 digits with bounding boxes in house numbers captured in Google Street View.		630,420	image, label, bounding boxes		2011	^[6]^[7]	Netzer et al.
JFT-300M	Dataset internal to Google Research. 303M images with 375M labels in 18291 categories		303,000,000	image, label		2017	^[8]^[9]^[10]	Google Research
JFT-3B	Internal to Google Research. 3 billion images, annotated with ~30k categories in a hierarchy.		3,000,000,000	image, label		2021	^[11]	Google Research
Places	10+ million images in 400+ scene classes, with 5000 to 30,000 images per class.		10,000,000	image, label		2018	^[12]	Zhou et al
Ego 4D	A massive-scale, egocentric dataset and benchmark suite collected across 74 worldwide locations and 9 countries, with over 3,670 hours of daily-life activity video.	Object bounding boxes, transcriptions, labeling.	3,670 video hours	video, audio, transcriptions	Multimodal first-person task	2022	^[13]	K. Grauman et al.
Wikipedia-based Image Text Dataset	37.5 million image-text examples with 11.5 million unique images across 108 Wikipedia languages.		11,500,000	image, caption	Pretraining, image captioning	2021	^[14]	Srinivasan e al, Google Research
Visual Genome	Images and their description		108,000	images, text	Image captioning	2016	^[15]	R. Krishna et al.
Berkeley 3-D Object Dataset	849 images taken in 75 different scenes. About 50 different object classes are labeled.	Object bounding boxes and labeling.	849	labeled images, text	Object recognition	2014	^[16]^[17]	A. Janoch et al.
Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500)	500 natural images, explicitly separated into disjoint train, validation and test subsets + benchmarking code. Based on BSDS300.	Each image segmented by five different subjects on average.	500	Segmented images	Contour detection and hierarchical image segmentation	2011	^[18]	University of California, Berkeley
Microsoft Common Objects in Context (MS COCO)	complex everyday scenes of common objects in their natural context.	Object highlighting, labeling, and classification into 91 object types.	2,500,000	Labeled images, text	Object recognition, image segmentation, keypointing, image captioning	2015	^[19]^[20]^[21]	T. Lin et al.
ImageNet	Labeled object image database, used in the ImageNet Large Scale Visual Recognition Challenge	Labeled objects, bounding boxes, descriptive words, SIFT features	14,197,122	Images, text	Object recognition, scene recognition	2009 (2014)	^[22]^[23]^[24]	J. Deng et al.
SUN (Scene UNderstanding)	Very large scene and object recognition database.	Places and objects are labeled. Objects are segmented.	131,067	Images, text	Object recognition, scene recognition	2014	^[25]^[26]	J. Xiao et al.
LSUN (Large SUN)	10 scene categories (bedroom, etc) and 20 object categories (airplane, etc)	Images and labels.	~60 million	Images, text	Object recognition, scene recognition	2015	^[27]^[28]^[29]	Yu et al.
LVIS (Large Vocabulary Instance Segmentation)	segmentation masks for over 1000 entry-level object categories in images		2.2 million segmentations, 164K images	Images, segmentation masks.	image segmentation masking	2019	^[30]
Open Images	A Large set of images listed as having CC BY 2.0 license with image-level labels and bounding boxes spanning thousands of classes.	Image-level labels, Bounding boxes	9,178,275	Images, text	Classification, Object recognition	2017 (V7 : 2022)	^[31]
TV News Channel Commercial Detection Dataset	TV commercials and news broadcasts.	Audio and video features extracted from still images.	129,685	Text	Clustering, classification	2015	^[32]^[33]	P. Guha et al.
Statlog (Image Segmentation) Dataset	The instances were drawn randomly from a database of 7 outdoor images and hand-segmented to create a classification for every pixel.	Many features calculated.	2310	Text	Classification	1990	^[34]	University of Massachusetts
Caltech 101	Pictures of objects.	Detailed object outlines marked.	9146	Images	Classification, object recognition	2003	^[35]^[36]	F. Li et al.
Caltech-256	Large dataset of images for object classification.	Images categorized and hand-sorted.	30,607	Images, Text	Classification, object detection	2007	^[37]^[38]	G. Griffin et al.
COYO-700M	Image–text-pair dataset	10 billion pairs of alt-text and image sources in HTML documents in CommonCrawl	746,972,269	Images, Text	Classification, Image-Language	2022	^[39]
SIFT10M Dataset	SIFT features of Caltech-256 dataset.	Extensive SIFT feature extraction.	11,164,866	Text	Classification, object detection	2016	^[40]	X. Fu et al.
LabelMe	Annotated pictures of scenes.	Objects outlined.	187,240	Images, text	Classification, object detection	2005	^[41]	MIT Computer Science and Artificial Intelligence Laboratory
PASCAL VOC Dataset	Images in 20 categories and localization bounding boxes.	Labeling, bounding box included	500,000	Images, text	Classification, object detection	2010	^[42]^[43]	M. Everingham et al.
CIFAR-10 Dataset	Many small, low-resolution, images of 10 classes of objects.	Classes labelled, training set splits created.	60,000	Images	Classification	2009	^[23]^[44]	A. Krizhevsky et al.
CIFAR-100 Dataset	Like CIFAR-10, above, but 100 classes of objects are given.	Classes labelled, training set splits created.	60,000	Images	Classification	2009	^[23]^[44]	A. Krizhevsky et al.
CINIC-10 Dataset	A unified contribution of CIFAR-10 and Imagenet with 10 classes, and 3 splits. Larger than CIFAR-10.	Classes labelled, training, validation, test set splits created.	270,000	Images	Classification	2018	^[45]	Luke N. Darlow, Elliot J. Crowley, Antreas Antoniou, Amos J. Storkey
Fashion-MNIST	A MNIST-like fashion product database	Classes labelled, training set splits created.	60,000	Images	Classification	2017	^[46]	Zalando SE
notMNIST	Some publicly available fonts and extracted glyphs from them to make a dataset similar to MNIST. There are 10 classes, with letters A–J taken from different fonts.	Classes labelled, training set splits created.	500,000	Images	Classification	2011	^[47]	Yaroslav Bulatov
Linnaeus 5 dataset	Images of 5 classes of objects.	Classes labelled, training set splits created.	8000	Images	Classification	2017	^[48]	Chaladze & Kalatozishvili
11K Hands	11,076 hand images (1600 x 1200 pixels) of 190 subjects, of varying ages between 18 – 75 years old, for gender recognition and biometric identification.	None	11,076 hand images	Images and (.mat, .txt, and .csv) label files	Gender recognition and biometric identification	2017	^[49]	M Afifi
CORe50	Specifically designed for Continuous/Lifelong Learning and Object Recognition, is a collection of more than 500 videos (30fps) of 50 domestic objects belonging to 10 different categories.	Classes labelled, training set splits created based on a 3-way, multi-runs benchmark.	164,866 RBG-D images	images (.png or .pkl) and (.pkl, .txt, .tsv) label files	Classification, Object recognition	2017	^[50]	V. Lomonaco and D. Maltoni
OpenLORIS-Object	Lifelong/Continual Robotic Vision dataset (OpenLORIS-Object) collected by real robots mounted with multiple high-resolution sensors, includes a collection of 121 object instances (1st version of dataset, 40 categories daily necessities objects under 20 scenes). The dataset has rigorously considered 4 environment factors under different scenes, including illumination, occlusion, object pixel size and clutter, and defines the difficulty levels of each factor explicitly.	Classes labelled, training/validation/testing set splits created by benchmark scripts.	1,106,424 RBG-D images	images (.png and .pkl) and (.pkl) label files	Classification, Lifelong object recognition, Robotic Vision	2019	^[51]	Q. She et al.
THz and thermal video data set	This multispectral data set includes terahertz, thermal, visual, near infrared, and three-dimensional videos of objects hidden under people's clothes.	images and 3D point clouds	More than 20 videos. The duration of each video is about 85 seconds (about 345 frames).	AP2J	Experiments with hidden object detection	2019	^[52]^[53]	Alexei A. Morozov and Olga S. Sushkova
TomatoMAP	A large-scale, annotated RGB image dataset of tomato plants designed for fine-grained phenotyping.	Labeling, bounding box included	720,938	images	Image classification, object detection, semantic segmentation, instance segmentation	2026	^[54]	Y. Zhang et al.

Dataset Name	Brief description	Instances	Format	Default Task	Created (updated)	Reference	Creator
Princeton Shape Benchmark	3D polygonal models collected from the Internet	1814 models in 92 categories	3D polygonal models, categories	shape-based retrieval and analysis	2004	^[57]^[58]	Shilane et al.
Berkeley 3-D Object Dataset (B3DO)	Depth and color images collected from crowdsourced Microsoft Kinect users. Annotated in 50 object categories.	849 images, in 75 scenes	color image, depth image, object class, bounding boxes, 3D center points	Predict bounding boxes	2011, updated 2014	^[59]	Janoch et al.
ShapeNet	3D models. Some are classified into WordNet synsets, like ImageNet. Partially classified into 3,135 categories.	3,000,000 models, 220,000 of which are classified.	3D models, class labels	Predict class label.	2015	^[60]	Chang et al.
ObjectNet3D	Images, 3D shapes, and objects 100 categories.	90127 images, 201888 objects, 44147 3D shapes	images, 3D shapes, object bounding boxes, category labels	recognizing the 3D pose and 3D shape of objects from 2D images	2016	^[61]^[62]	Xiang et al.
Common Objects in 3D (CO3D)	Video frames from videos capturing objects from 50 MS-COCO categories, filmed by people on Amazon Mechanical Turk.	6 million frames from 40000 videos	multi-view images, camera poses, 3D point clouds, object category	Predict object category. Generate objects.	2021, updated 2022 as CO3Dv2	^[63]^[64]	Meta AI
Google Scanned Objects	Scanned objects in SDF format.	over 10 million			2022	^[56]	Google AI
Objectverse-XL	3D objects	over 10 million	3D objects, metadata	novel view synthesis, 3D object generation	2023	^[65]	Deitke et al.
OmniObject3D	Scanned objects, labelled in 190 daily categories	6,000	textured meshes, point clouds, multiview images, videos	robust 3D perception, novel-view synthesis, surface reconstruction, 3D object generation	2023	^[66]^[67]	Wu et al.
UnCommon Objects in 3D (uCO3D)	1,070 categories in the LVIS				2025	^[68]^[69]	Meta AI

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Cityscapes Dataset	Stereo video sequences recorded in street scenes, with pixel-level annotations. Metadata also included.	Pixel-level segmentation and labeling	25,000	Images, text	Classification, object detection	2016	^[70]	Daimler AG et al.
German Traffic Sign Detection Benchmark Dataset	Images from vehicles of traffic signs on German roads. These signs comply with UN standards and therefore are the same as in other countries.	Signs manually labeled	900	Images	Classification	2013	^[71]^[72]	S. Houben et al.
KITTI Vision Benchmark Dataset	Autonomous vehicles driving through a mid-size city captured images of various areas using cameras and laser scanners.	Many benchmarks extracted from data.	>100 GB of data	Images, text	Classification, object detection	2012	^[73]^[74]^[75]	A. Geiger et al.
FieldSAFE	Multi-modal dataset for obstacle detection in agriculture including stereo camera, thermal camera, web camera, 360-degree camera, lidar, radar, and precise localization.	Classes labelled geographically.	>400 GB of data	Images and 3D point clouds	Classification, object detection, object localization	2017	^[76]	M. Kragh et al.
Daimler Monocular Pedestrian Detection dataset	It is a dataset of pedestrians in urban environments.	Pedestrians are box-wise labeled.	Labeled part contains 15560 samples with pedestrians and 6744 samples without. Test set contains 21790 images without labels.	Images	Object recognition and classification	2006	^[77]^[78]^[79]	Daimler AG
CamVid	The Cambridge-driving Labeled Video Database (CamVid) is a collection of videos.	The dataset is labeled with semantic labels for 32 semantic classes.	over 700 images	Images	Object recognition and classification	2008	^[80]^[81]^[82]	Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, Roberto Cipolla
RailSem19	RailSem19 is a dataset for understanding scenes for vision systems on railways.	The dataset is labeled semanticly and box-wise.	8500	Images	Object recognition and classification, scene recognition	2019	^[83]^[84]	Oliver Zendel, Markus Murschitz, Marcel Zeilinger, Daniel Steininger, Sara Abbasi, Csaba Beleznai
BOREAS	BOREAS is a multi-season autonomous driving dataset. It includes data from includes a Velodyne Alpha-Prime (128-beam) lidar, a FLIR Blackfly S camera, a Navtech CIR304-H radar, and an Applanix POS LV GNSS-INS.	The data is annotated by 3D bounding boxes.	350 km of driving data	Images, Lidar and Radar data	Object recognition and classification, scene recognition	2023	^[85]^[86]	Keenan Burnett, David J. Yoon, Yuchen Wu, Andrew Zou Li, Haowei Zhang, Shichen Lu, Jingxing Qian, Wei-Kang Tseng, Andrew Lambert, Keith Y.K. Leung, Angela P. Schoellig, Timothy D. Barfoot
Bosch Small Traffic Lights Dataset	It is a dataset of traffic lights.	The labeling include bounding boxes of traffic lights together with their state (active light).	5000 images for training and a video sequence of 8334 frames for evaluation	Images	Traffic light recognition	2017	^[87]^[88]	Karsten Behrendt, Libor Novak, Rami Botros
FRSign	It is a dataset of French railway signals.	The labeling include bounding boxes of railway signals together with their state (active light).	more than 100000	Images	Railway signal recognition	2020	^[89]^[90]	Jeanine Harb, Nicolas Rébéna, Raphaël Chosidow, Grégoire Roblin, Roman Potarusov, Hatem Hajri
GERALD	It is a dataset of German railway signals.	The labeling include bounding boxes of railway signals together with their state (active light).	5000	Images	Railway signal recognition	2023	^[91]^[92]	Philipp Leibner, Fabian Hampel, Christian Schindler
Multi-cue pedestrian	Multi-cue onboard pedestrian detection dataset is a dataset for detection of pedestrians.	The databaset is labeled box-wise.	1092 image pairs with 1776 boxes for pedestrians	Images	Object recognition and classification	2009	^[93]	Christian Wojek, Stefan Walk, Bernt Schiele
RAWPED	RAWPED is a dataset for detection of pedestrians in the context of railways.	The dataset is labeled box-wise.	26000	Images	Object recognition and classification	2020	^[94]^[95]	Tugce Toprak, Burak Belenlioglu, Burak Aydın, Cuneyt Guzelis, M. Alper Selver
OSDaR23	OSDaR23 is a multi-sensory dataset for detection of objects in the context of railways.	The databaset is labeled box-wise.	16874 frames	Images, Lidar, Radar and Infrared	Object recognition and classification	2023	^[96]^[97]	Roman Tilly, Rustam Tagiew, Pavel Klasek (DZSF); Philipp Neumaier, Patrick Denzler, Tobias Klockau, Martin Boekhoff, Martin Köppel (Digitale Schiene Deutschland); Karsten Schwalbe (FusionSystems)
Agroverse	Argoverse is a multi-sensory dataset for detection of objects in the context of roads.	The dataset is annotated box-wise.	320 hours of recording	Data from 7 cameras and LiDAR	Object recognition and classification, object tracking	2022	^[98]^[99]	Argo AI, Carnegie Mellon University, Georgia Institute of Technology
Rail3D	Rail3D is a LiDAR dataset for railways recorded in Hungary, France, and Belgium	The dataset is annotated semantically	288 million annotated points	LiDAR	Object recognition and classification, object tracking	2024	^[100]	Abderrazzaq Kharroubi, Ballouch Zouhair, Rafika Hajji, Anass Yarroudh, and Roland Billen; University of Liège and Hassan II Institute of Agronomy and Veterinary Medicine
WHU-Railway3D	WHU-Railway3D is a LiDAR dataset for urban, rural, and plateau railways recorded in China	The dataset is annotated semantically	4.6 billion annotated data points	LiDAR	Object recognition and classification, object tracking	2024	^[101]	Bo Qiu, Yuzhou Zhou, Lei Dai; Bing Wang, Jianping Li, Zhen Dong, Chenglu Wen, Zhiliang Ma, Bisheng Yang; Wuhan University, University of Oxford, Hong Kong Polytechnic University, Nanyang Technological University, Xiamen University and Tsinghua University
RailFOD23	A dataset of foreign objects on railway catenary	The dataset is annotated boxwise	14,615 images	Images	Object recognition and classification, object tracking	2024	^[102]	Zhichao Chen, Jie Yang, Zhicheng Feng, Hao Zhu; Jiangxi University of Science and Technology
ESRORAD	A dataset of images and point clouds for urban road and rail scenes from Le Havre and Rouen	The dataset is annotated boxwise	2,700 k virtual images and 100,000 real images	Images, LiDAR	Object recognition and classification, object tracking	2022	^[103]	Redouane Khemmar, Antoine Mauri, Camille Dulompont, Jayadeep Gajula, Vincent Vauchey, Madjid Haddad and Rémi Boutteau; Le Havre Normandy University and SEGULA Technologies
RailVID	Data recorded by AT615X infrared thermography from InfiRay in diverse railway scenarios, including carport, depot, and straight.	The dataset is annotated semantically	1,071 images	infrared images	Object recognition and classification, object tracking	2022	^[104]	Hao Yuan, Zhenkun Mei, Yihao Chen, Weilong Niu, Cheng Wu; Soochow University
RailPC	LiDAR dataset in the context of railways	The dataset is annotated semantically	3 billion data points	LiDAR	Object recognition and classification, object tracking	2024	^[105]	Tengping Jiang, Shiwei Li, Qinyu Zhang, Guangshuai Wang, Zequn Zhang, Fankun Zeng, Peng An, Xin Jin, Shan Liu, Yongjun Wang; Nanjing Normal University, Ministry of Natural Resources, Eastern Institute of Technology, Tianjin Key Laboratory of Rail Transit Navigation Positioning and Spatio‐temporal Big Data Technology, Northwest Normal University, Washington University in St. Louis and Ningbo University of Technology
RailCloud-HdF	LiDAR dataset in the context of railways	The dataset is annotated semantically	8060.3 million data points	LiDAR	Object recognition and classification, object tracking	2024	^[106]	Mahdi Abid, Mathis Teixeira, Ankur Mahtani and Thomas Laurent; Railenium
RailGoerl24	RGB and LiDAR dataset in the context of railways	The dataset is annotated boxwise	12205 HD RGB frames and 383922305 LiDAR colored cloud points	RGB, LiDAR	Person recognition and classification	2025	^[107]	Rustam Tagiew, Ilkay Wunderlich, Philipp Zanitzer, Mark, Sastuba, Carsten Knoll, Kilian Göller, Haadia Amjad, Steffen Seitz
MRSI	RGB and Infrared dataset in the context of railways	The dataset is annotated boxwise and pixelwise, eleven classes including background	23000 RGB images and 4000 infrared images	RGB, Infrared	Object recognition and classification	2022	^[108]	Yihao Chen, Ning Zhu, Qian Wu, Cheng Wu, Weilong Niu and Yiming Wang
RailDriVE February 2019	Data Set for Rail Vehicle Positioning Experiments	The dataset is not annotated	26:46 min back and forward driving on an 1.2 km track segment	GNSS, IMU, Speed/distance sensors (Radar, optical, odometer), RGB	Lokalisation and mapping	2019	^[109]	Hanno Winter, Michael Helmut Roth

Dataset name	Brief description	Preprocessing	Instances	Format	Default task	Created (updated)	Reference	Creator
Labeled Faces in the Wild (LFW)	Images of named individuals obtained by Internet search.	frontal face detection, bounding box cropping	13233 images of 5749 named individuals	images, labels	unconstrained face recognition	2008	^[111]^[112]	Huang et al.
Aff-Wild	298 videos of 200 individuals, ~1,250,000 manually annotated images: annotated in terms of dimensional affect (valence-arousal); in-the-wild setting; color database; various resolutions (average = 640x360)	the detected faces, facial landmarks and valence-arousal annotations	~1,250,000 manually annotated images	video (visual + audio modalities)	affect recognition (valence-arousal estimation)	2017	CVPR^[113] IJCV^[114]	D. Kollias et al.
Aff-Wild2	558 videos of 458 individuals, ~2,800,000 manually annotated images: annotated in terms of i) categorical affect (7 basic expressions: neutral, happiness, sadness, surprise, fear, disgust, anger); ii) dimensional affect (valence-arousal); iii) action units (AUs 1,2,4,6,12,15,20,25); in-the-wild setting; color database; various resolutions (average = 1030x630)	the detected faces, detected and aligned faces and annotations	~2,800,000 manually annotated images	video (visual + audio modalities)	affect recognition (valence-arousal estimation, basic expression classification, action unit detection)	2019	BMVC^[115] FG^[116]	D. Kollias et al.
FERET (facial recognition technology)	11338 images of 1199 individuals in different positions and at different times.	None.	11,338	Images	Classification, face recognition	2003	^[117]^[118]	United States Department of Defense
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)	7,356 video and audio recordings of 24 professional actors. 8 emotions each at two intensities.	Files labelled with expression. Perceptual validation ratings provided by 319 raters.	7,356	Video, sound files	Classification, face recognition, voice recognition	2018	^[119]^[120]	S.R. Livingstone and F.A. Russo
SCFace	Color images of faces at various angles.	Location of facial features extracted. Coordinates of features given.	4,160	Images, text	Classification, face recognition	2011	^[121]^[122]	M. Grgic et al.
Yale Face Database	Faces of 15 individuals in 11 different expressions.	Labels of expressions.	165	Images	Face recognition	1997	^[123]^[124]	J. Yang et al.
Cohn-Kanade AU-Coded Expression Database	Large database of images with labels for expressions.	Tracking of certain facial features.	500+ sequences	Images, text	Facial expression analysis	2000	^[125]^[126]	T. Kanade et al.
JAFFE Facial Expression Database	213 images of 7 facial expressions (6 basic facial expressions + 1 neutral) posed by 10 Japanese female models.	Images are cropped to the facial region. Includes semantic ratings data on emotion labels.	213	Images, text	Facial expression cognition	1998	^[127]^[128]	Lyons, Kamachi, Gyoba
FaceScrub	Images of public figures scrubbed from image searching.	Name and m/f annotation.	107,818	Images, text	Face recognition	2014	^[129]^[130]	H. Ng et al.
BioID Face Database	Images of faces with eye positions marked.	Manually set eye positions.	1521	Images, text	Face recognition	2001	^[131]	BioID
Skin Segmentation Dataset	Randomly sampled color values from face images.	B, G, R, values extracted.	245,057	Text	Segmentation, classification	2012	^[132]^[133]	R. Bhatt.
Bosphorus	3D Face image database.	34 action units and 6 expressions labeled; 24 facial landmarks labeled.	4652	Images, text	Face recognition, classification	2008	^[134]^[135]	A Savran et al.
UOY 3D-Face	neutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised.	labeling.	5250	Images, text	Face recognition, classification	2004	^[136]^[137]	University of York
CASIA 3D Face Database	Expressions: Anger, smile, laugh, surprise, closed eyes.	None.	4624	Images, text	Face recognition, classification	2007	^[138]^[139]	Institute of Automation, Chinese Academy of Sciences
CASIA NIR	Expressions: Anger Disgust Fear Happiness Sadness Surprise	None.	480	Annotated Visible Spectrum and Near Infrared Video captures at 25 frames per second	Face recognition, classification	2011	^[140]	Zhao, G. et al.
BU-3DFE	neutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear (4 levels). 3D images extracted.	None.	2500	Images, text	Facial expression recognition, classification	2006	^[141]	Binghamton University
Face Recognition Grand Challenge Dataset	Up to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data.	None.	4007	Images, text	Face recognition, classification	2004	^[142]^[143]	National Institute of Standards and Technology
Gavabdb	Up to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images.	None.	549	Images, text	Face recognition, classification	2008	^[144]^[145]	King Juan Carlos University
3D-RMA	Up to 100 subjects, expressions mostly neutral. Several poses as well.	None.	9971	Images, text	Face recognition, classification	2004	^[146]^[147]	Royal Military Academy (Belgium)
SoF	112 persons (66 males and 46 females) wear glasses under different illumination conditions.	A set of synthetic filters (blur, occlusions, noise, and posterization ) with different level of difficulty.	42,592 (2,662 original image × 16 synthetic image)	Images, Mat file	Gender classification, face detection, face recognition, age estimation, and glasses detection	2017	^[148]^[149]	Afifi, M. et al.
IMDb-WIKI	IMDb and Wikipedia face images with gender and age labels.	None	523,051	Images	Gender classification, face detection, face recognition, age estimation	2015	^[150]	R. Rothe, R. Timofte, L. V. Gool

Dataset name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
AVA-Kinetics Localized Human Actions Video	Annotated 80 action classes from keyframes from videos from Kinetics-700.		1.6 million annotations. 238,906 video clips, 624,430 keyframes.	Annotations, videos.	Action prediction	2020	^[151]^[152]	Li et al from Perception Team of Google AI.
TV Human Interaction Dataset	Videos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss and none.	None.	6,766 video clips	video clips	Action prediction	2013	^[153]	Patron-Perez, A. et al.
Berkeley Multimodal Human Action Database (MHAD)	Recordings of a single person performing 12 actions	MoCap pre-processing	660 action samples	8 PhaseSpace Motion Capture, 2 Stereo Cameras, 4 Quad Cameras, 6 accelerometers, 4 microphones	Action classification	2013	^[154]	Ofli, F. et al.
THUMOS Dataset	Large video dataset for action classification.	Actions classified and labeled.	45M frames of video	Video, images, text	Classification, action detection	2013	^[155]^[156]	Y. Jiang et al.
MEXAction2	Video dataset for action localization and spotting	Actions classified and labeled.	1000	Video	Action detection	2014	^[157]	Stoian et al.

List of datasets in computer vision and image processing

Object detection and recognition

3D Objects

Object detection and recognition for autonomous vehicles

Facial recognition

Action recognition

Handwriting and character recognition

Aerial images

Underwater images

Other images

References

Related Articles

Dataset name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Artificial Characters Dataset	Artificially generated data describing the structure of 10 capital English letters.	Coordinates of lines drawn given as integers. Various other features.	6000	Text	Handwriting recognition, classification	1992	^[158]	H. Guvenir et al.
Letter Dataset	Upper-case printed letters.	17 features are extracted from all images.	20,000	Text	OCR, classification	1991	^[159]^[160]	D. Slate et al.
CASIA-HWDB	Offline handwritten Chinese character database. 3755 classes in the GB 2312 character set.	Gray-scaled images with background pixels labeled as 255.	1,172,907	Images, Text	Handwriting recognition, classification	2009	^[161]	CASIA
CASIA-OLHWDB	Online handwritten Chinese character database, collected using Anoto pen on paper. 3755 classes in the GB 2312 character set.	Provides the sequences of coordinates of strokes.	1,174,364	Images, Text	Handwriting recognition, classification	2009	^[162]^[161]	CASIA
Character Trajectories Dataset	Labeled samples of pen tip trajectories for people writing simple characters.	3-dimensional pen tip velocity trajectory matrix for each sample	2858	Text	Handwriting recognition, classification	2008	^[163]^[164]	B. Williams
Chars74K Dataset	Character recognition in natural images of symbols used in both English and Kannada		74,107		Character recognition, handwriting recognition, OCR, classification	2009	^[165]	T. de Campos
EMNIST dataset	Handwritten characters from 3600 contributors	Derived from NIST Special Database 19. Converted to 28x28 pixel images, matching the MNIST dataset.^[166]	800,000	Images	character recognition, classification, handwriting recognition	2016	EMNIST dataset^[167] Documentation^[168]	Gregory Cohen, et al.
UJI Pen Characters Dataset	Isolated handwritten characters	Coordinates of pen position as characters were written given.	11,640	Text	Handwriting recognition, classification	2009	^[169]^[170]	F. Prat et al.
Gisette Dataset	Handwriting samples from the often-confused 4 and 9 characters.	Features extracted from images, split into train/test, handwriting images size-normalized.	13,500	Images, text	Handwriting recognition, classification	2003	^[171]	Yann LeCun et al.
Omniglot dataset	1623 different handwritten characters from 50 different alphabets.	Hand-labeled.	38,300	Images, text, strokes	Classification, one-shot learning	2015	^[172]^[173]	American Association for the Advancement of Science
MNIST database	Database of handwritten digits.	Hand-labeled.	60,000	Images, text	Classification	1994	^[174]^[175]	National Institute of Standards and Technology
Optical Recognition of Handwritten Digits Dataset	Normalized bitmaps of handwritten data.	Size normalized and mapped to bitmaps.	5620	Images, text	Handwriting recognition, classification	1998	^[176]	E. Alpaydin et al.
Pen-Based Recognition of Handwritten Digits Dataset	Handwritten digits on electronic pen-tablet.	Feature vectors extracted to be uniformly spaced.	10,992	Images, text	Handwriting recognition, classification	1998	^[177]^[178]	E. Alpaydin et al.
Semeion Handwritten Digit Dataset	Handwritten digits from 80 people.	All handwritten digits have been normalized for size and mapped to the same grid.	1593	Images, text	Handwriting recognition, classification	2008	^[179]	T. Srl
HASYv2	Handwritten mathematical symbols	All symbols are centered and of size 32px x 32px.	168233	Images, text	Classification	2017	^[180]	Martin Thoma
Noisy Handwritten Bangla Dataset	Includes Handwritten Numeral Dataset (10 classes) and Basic Character Dataset (50 classes), each dataset has three types of noise: white gaussian, motion blur, and reduced contrast.	All images are centered and of size 32x32.	Numeral Dataset: 23330, Character Dataset: 76000	Images, text	Handwriting recognition, classification	2017	^[181]^[182]	M. Karki et al.

Dataset name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
iSAID: Instance Segmentation in Aerial Images Dataset		Precise instance-level annotatio carried out by professional annotators, cross-checked and validated by expert annotators complying with well-defined guidelines.	655,451 (15 classes)	Images, jpg, json	Aerial Classification, Object Detection, Instance Segmentation	2019	^[183]^[184]	Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, Xiang Bai
Aerial Image Segmentation Dataset	80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0.	Images manually segmented.	80	Images	Aerial Classification, object detection	2013	^[185]^[186]	J. Yuan et al.
KIT AIS Data Set	Multiple labeled training and evaluation datasets of aerial images of crowds.	Images manually labeled to show paths of individuals through crowds.	~ 150	Images with paths	People tracking, aerial tracking	2012	^[187]^[188]	M. Butenuth et al.
Wilt Dataset	Remote sensing data of diseased trees and other land cover.	Various features extracted.	4899	Images	Classification, aerial object detection	2014	^[189]^[190]	B. Johnson
MASATI dataset	Maritime scenes of optical aerial images from the visible spectrum. It contains color images in dynamic marine environments, each image may contain one or multiple targets in different weather and illumination conditions.	Object bounding boxes and labeling.	7389	Images	Classification, aerial object detection	2018	^[191]^[192]	A.-J. Gallego et al.
Forest Type Mapping Dataset	Satellite imagery of forests in Japan.	Image wavelength bands extracted.	326	Text	Classification	2015	^[193]^[194]	B. Johnson
Overhead Imagery Research Data Set	Annotated overhead imagery. Images with multiple objects.	Over 30 annotations and over 60 statistics that describe the target within the context of the image.	1000	Images, text	Classification	2009	^[195]^[196]	F. Tanner et al.
SpaceNet	SpaceNet is a corpus of commercial satellite imagery and labeled training data.	GeoTiff and GeoJSON files containing building footprints.	>17533	Images	Classification, Object Identification	2017	^[197]^[198]^[199]	DigitalGlobe, Inc.
UC Merced Land Use Dataset	These images were manually extracted from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the US.	This is a 21 class land use image dataset meant for research purposes. There are 100 images for each class.	2,100	Image chips of 256x256, 30 cm (1 foot) GSD	Land cover classification	2010	^[200]	Yi Yang and Shawn Newsam
SAT-4 Airborne Dataset	Images were extracted from the National Agriculture Imagery Program (NAIP) dataset.	SAT-4 has four broad land cover classes, includes barren land, trees, grassland and a class that consists of all land cover classes other than the above three.	500,000	Images	Classification	2015	^[201]^[202]	S. Basu et al.
SAT-6 Airborne Dataset	Images were extracted from the National Agriculture Imagery Program (NAIP) dataset.	SAT-6 has six broad land cover classes, includes barren land, trees, grassland, roads, buildings and water bodies.	405,000	Images	Classification	2015	^[201]^[202]	S. Basu et al.

Dataset name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
SUIM Dataset	The images have been rigorously collected during oceanic explorations and human-robot collaborative experiments, and annotated by human participants.	Images with pixel annotations for eight object categories: fish (vertebrates), reefs (invertebrates), aquatic plants, wrecks/ruins, human divers, robots, and sea-floor.	1,635	Images	Segmentation	2020	^[203]	Md Jahidul Islam et al.
LIACI Dataset	Images have been collected during underwater ship inspections and annotated by human domain experts.	Images with pixel annotations for ten object categories: defects, corrosion, paint peel, marine growth, sea chest gratings, overboard valves, propeller, anodes, bilge keel and ship hull.	1,893	Images	Segmentation	2022	^[204]	Waszak et al.

Dataset name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Kodak Lossless True Color Image Suite	RGB images for testing image compression.	None	24	Image	Image compression	1999	^[205]	Kodak
NRC-GAMMA	A novel benchmark gas meter image dataset	None	28,883	Image, Label	Classification	2021	^[206]^[207]	A. Ebadi, P. Paul, S. Auer, & S. Tremblay
The SUPATLANTIQUE dataset	Images of scanned official and Wikipedia documents	None	4908	TIFF/pdf	Source device identification, forgery detection, Classification,..	2020	^[208]	C. Ben Rabah et al.
Density functional theory quantum simulations of graphene	Labelled images of raw input to a simulation of graphene	Raw data (in HDF5 format) and output labels from density functional theory quantum simulation	60744 test and 501473 training files	Labeled images	Regression	2019	^[209]	K. Mills & I. Tamblyn
Quantum simulations of an electron in a two dimensional potential well	Labelled images of raw input to a simulation of 2d Quantum mechanics	Raw data (in HDF5 format) and output labels from quantum simulation	1.3 million images	Labeled images	Regression	2017	^[210]	K. Mills, M.A. Spanner, & I. Tamblyn
MPII Cooking Activities Dataset	Videos and images of various cooking activities.	Activity paths and directions, labels, fine-grained motion labeling, activity class, still image extraction and labeling.	881,755 frames	Labeled video, images, text	Classification	2012	^[211]^[212]	M. Rohrbach et al.
FAMOS Dataset	5,000 unique microstructures, all samples have been acquired 3 times with two different cameras.	Original PNG files, sorted per camera and then per acquisition. MATLAB datafiles with one 16384 times 5000 matrix per camera per acquisition.	30,000	Images and .mat files	Authentication	2012	^[213]	S. Voloshynovskiy, et al.
PharmaPack Dataset	1,000 unique classes with 54 images per class.	Class labeling, many local descriptors, like SIFT and aKaZE, and local feature agreators, like Fisher Vector (FV).	54,000	Images and .mat files	Fine-grain classification	2017	^[214]	O. Taran and S. Rezaeifar, et al.
Stanford Dogs Dataset	Images of 120 breeds of dogs from around the world.	Train/test splits and ImageNet annotations provided.	20,580	Images, text	Fine-grain classification	2011	^[215]^[216]	A. Khosla et al.
StanfordExtra Dataset	2D keypoints and segmentations for the Stanford Dogs Dataset.	2D keypoints and segmentations provided.	12,035	Labelled images	3D reconstruction/pose estimation	2020	^[217]	B. Biggs et al.
The Oxford-IIIT Pet Dataset	37 categories of pets with roughly 200 images of each.	Breed labeled, tight bounding box, foreground-background segmentation.	~ 7,400	Images, text	Classification, object detection	2012	^[216]^[218]	O. Parkhi et al.
Corel Image Features Data Set	Database of images with features extracted.	Many features including color histogram, co-occurrence texture, and colormoments,	68,040	Text	Classification, object detection	1999	^[219]^[220]	M. Ortega-Bindenberger et al.
Online Video Characteristics and Transcoding Time Dataset.	Transcoding times for various different videos and video properties.	Video features given.	168,286	Text	Regression	2015	^[221]	T. Deneke et al.
Microsoft Sequential Image Narrative Dataset (SIND)	Dataset for sequential vision-to-language	Descriptive caption and storytelling given for each photo, and photos are arranged in sequences	81,743	Images, text	Visual storytelling	2016	^[222]	Microsoft Research
Caltech-UCSD Birds-200-2011 Dataset	Large dataset of images of birds.	Part locations for birds, bounding boxes, 312 binary attributes given	11,788	Images, text	Classification	2011	^[223]^[224]	C. Wah et al.
YouTube-8M	Large and diverse labeled video dataset	YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities	8 million	Video, text	Video classification	2016	^[225]^[226]	S. Abu-El-Haija et al.
YFCC100M	Large and diverse labeled image and video dataset	Flickr Videos and Images and associated description, titles, tags, and other metadata (such as Exif and geotags)	100 million	Video, Image, Text	Video and Image classification	2016	^[227]^[228]	B. Thomee et al.
Discrete LIRIS-ACCEDE	Short videos annotated for valence and arousal.	Valence and arousal labels.	9800	Video	Video emotion elicitation detection	2015	^[229]	Y. Baveye et al.
Continuous LIRIS-ACCEDE	Long videos annotated for valence and arousal while also collecting Galvanic Skin Response.	Valence and arousal labels.	30	Video	Video emotion elicitation detection	2015	^[230]	Y. Baveye et al.
MediaEval LIRIS-ACCEDE	Extension of Discrete LIRIS-ACCEDE including annotations for violence levels of the films.	Violence, valence and arousal labels.	10900	Video	Video emotion elicitation detection	2015	^[231]	Y. Baveye et al.
Leeds Sports Pose	Articulated human pose annotations in 2000 natural sports images from Flickr.	Rough crop around single person of interest with 14 joint labels	2000	Images plus .mat file labels	Human pose estimation	2010	^[232]	S. Johnson and M. Everingham
Leeds Sports Pose Extended Training	Articulated human pose annotations in 10,000 natural sports images from Flickr.	14 joint labels via crowdsourcing	10000	Images plus .mat file labels	Human pose estimation	2011	^[233]	S. Johnson and M. Everingham
MCQ Dataset	6 different real multiple choice-based exams (735 answer sheets and 33,540 answer boxes) to evaluate computer vision techniques and systems developed for multiple choice test assessment systems.	None	735 answer sheets and 33,540 answer boxes	Images and .mat file labels	Development of multiple choice test assessment systems	2017	^[234]^[235]	Afifi, M. et al.
Surveillance Videos	Real surveillance videos cover a large surveillance time (7 days with 24 hours each).	None	19 surveillance videos (7 days with 24 hours each).	Videos	Data compression	2016	^[236]	Taj-Eddin, I. A. T. F. et al.
LILA BC	Labeled Information Library of Alexandria: Biology and Conservation. Labeled images that support machine learning research around ecology and environmental science.	None	~10M images	Images	Classification	2019	^[237]	LILA working group
Can We See Photosynthesis?	32 videos for eight live and eight dead leaves recorded under both DC and AC lighting conditions.	None	32 videos	Videos	Liveness detection of plants	2017	^[238]	Taj-Eddin, I. A. T. F. et al.
Mathematical Mathematics Memes	Collection of 10,000 memes on mathematics.	None	~10,000	Images	Visual storytelling, object detection.	2021	^[239]	Mathematical Mathematics Memes
Flickr-Faces-HQ Dataset	Collection of images containing a face each, crawled from Flickr	Pruned with "various automatic filters", cropped and aligned to faces, and had images of statues, paintings, or photos of photos removed via crowdsourcing	70,000	Images	Face Generation	2019	^[240]	Karras et al.
Fruits-360 dataset	Collection of images containing 170 fruits, vegetables, nuts, and seeds.	100x100 pixels, white background.	115499	Images (jpg)	Classification	2017–2025	^[241]	Mihai Oltean
RVL-CDIP	Scanned documents from the Truth Tobacco Industry Documents library.	Maximum dimension is 1000 pixels.	400,000	Images	Classification	2015	^[242]	A. Harley et al.

Related Articles

"WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning"

Imagenet: A large-scale hierarchical image database

Curler: finding and visualizing nonlinear correlation clusters

"Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping"

Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark

"Papers with Code - Daimler Monocular Pedestrian Detection Dataset"

"Semantic object classes in video: A high-definition ground truth database"

"GERALD: A novel dataset for the detection of German mainline railway signals"

"Conditional Weighted Ensemble of Transferred Models for Camera Based Onboard Pedestrian Detection in Railway Driver Support Systems"

"RailFOD23: A dataset for foreign object detection on railroad transmission lines"

"Road and Railway Smart Mobility: A High-Definition Ground Truth Hybrid Dataset"

"Görlitz Rail Test Center CV Dataset 2024 (RailGoerl24)"

"The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English"

Coding facial expressions with Gabor wavelets

"Fuzzy logic color detection: Blue areas in melanoma dermoscopy images"

"3D shape-based face recognition using automatically registered facial surfaces"

Learning a mixture of sparse distance metrics for classification and dimensionality reduction

Low level crowd analysis using frame-wise normalized feature for people counting

"A new classification model for a class imbalanced data set using genetic programming and support vector machines: Case study for wilt disease classification"

https://www.iuii.ua.es/datasets/masati/

Forest Type Classification: A Hybrid NN-GA Model Based Approach

The mediaeval 2015 affective impact of movies task