How Small is Big Enough? Open Labeled Datasets and the Development of Deep Learning

Daniel Souza; Aldo Geuna; Jeff Rodríguez

How Small is Big Enough? Open Labeled Datasets and the Development of Deep Learning

Daniel Souza, Aldo Geuna, Jeff Rodríguez

TL;DR

This paper investigates how open labeled datasets, particularly CIFAR-10, catalyzed the emergence and trajectory of Deep Learning as a technoscience. Using a mixed-methods approach—qualitative interviews, a CIFAR-10–focused survey, and econometric analysis of hundreds of papers and patent citations from 2010–2022—the authors trace how dataset size, labeling quality, and accessibility enabled rapid architectural experimentation, teaching diffusion, and early breakthroughs (notably AlexNet). Econometric results show CIFAR-10 papers garnered substantial patent citations in the early period and ongoing, though scientific citations waned after 2014, whereas ImageNet maintained a stronger and more persistent influence on scientific progress. The study highlights a division of labor between CIFAR-10’s technological impact and ImageNet’s ongoing scientific influence, arguing that small, open datasets played a critical and lasting role in the DL revolution and in open science diffusion, with implications for future data-sharing policy and funding strategies.

Abstract

We investigate the emergence of Deep Learning as a technoscientific field, emphasizing the role of open labeled datasets. Through qualitative and quantitative analyses, we evaluate the role of datasets like CIFAR-10 in advancing computer vision and object recognition, which are central to the Deep Learning revolution. Our findings highlight CIFAR-10's crucial role and enduring influence on the field, as well as its importance in teaching ML techniques. Results also indicate that dataset characteristics such as size, number of instances, and number of categories, were key factors. Econometric analysis confirms that CIFAR-10, a small-but-sufficiently-large open dataset, played a significant and lasting role in technological advancements and had a major function in the development of the early scientific literature as shown by citation metrics.

How Small is Big Enough? Open Labeled Datasets and the Development of Deep Learning

TL;DR

Abstract

Paper Structure (21 sections, 2 equations, 9 figures, 16 tables)

This paper contains 21 sections, 2 equations, 9 figures, 16 tables.

Introduction
Conceptual framework
Open Science
The GPU Revolution
AI as a Technoscience
Institutional background
Winning the brain wars: The emergence of DL as a dominant paradigm within AI
The development of Open Labeled Datasets and CIFAR-10
Methods and data
Method
Data
Findings
Interviews analysis
Survey analysis
Econometric analysis
...and 6 more sections

Figures (9)

Figure 1: Distribution of Publications by Subject Area
Figure 2: The Rise of Annotated Image Datasets
Figure 3: Survey Results - CIFAR-10 Datasets Impact on DL & Computer Vision
Figure 4: Survey Results - Comparing CIFAR-10 with Similar Datasets
Figure 5: Survey Results - Integration of CIFAR-10 in Teaching environment
...and 4 more figures

How Small is Big Enough? Open Labeled Datasets and the Development of Deep Learning

TL;DR

Abstract

How Small is Big Enough? Open Labeled Datasets and the Development of Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)