Table of Contents
Fetching ...

How Small is Big Enough? Open Labeled Datasets and the Development of Deep Learning

Daniel Souza, Aldo Geuna, Jeff Rodríguez

TL;DR

This paper investigates how open labeled datasets, particularly CIFAR-10, catalyzed the emergence and trajectory of Deep Learning as a technoscience. Using a mixed-methods approach—qualitative interviews, a CIFAR-10–focused survey, and econometric analysis of hundreds of papers and patent citations from 2010–2022—the authors trace how dataset size, labeling quality, and accessibility enabled rapid architectural experimentation, teaching diffusion, and early breakthroughs (notably AlexNet). Econometric results show CIFAR-10 papers garnered substantial patent citations in the early period and ongoing, though scientific citations waned after 2014, whereas ImageNet maintained a stronger and more persistent influence on scientific progress. The study highlights a division of labor between CIFAR-10’s technological impact and ImageNet’s ongoing scientific influence, arguing that small, open datasets played a critical and lasting role in the DL revolution and in open science diffusion, with implications for future data-sharing policy and funding strategies.

Abstract

We investigate the emergence of Deep Learning as a technoscientific field, emphasizing the role of open labeled datasets. Through qualitative and quantitative analyses, we evaluate the role of datasets like CIFAR-10 in advancing computer vision and object recognition, which are central to the Deep Learning revolution. Our findings highlight CIFAR-10's crucial role and enduring influence on the field, as well as its importance in teaching ML techniques. Results also indicate that dataset characteristics such as size, number of instances, and number of categories, were key factors. Econometric analysis confirms that CIFAR-10, a small-but-sufficiently-large open dataset, played a significant and lasting role in technological advancements and had a major function in the development of the early scientific literature as shown by citation metrics.

How Small is Big Enough? Open Labeled Datasets and the Development of Deep Learning

TL;DR

This paper investigates how open labeled datasets, particularly CIFAR-10, catalyzed the emergence and trajectory of Deep Learning as a technoscience. Using a mixed-methods approach—qualitative interviews, a CIFAR-10–focused survey, and econometric analysis of hundreds of papers and patent citations from 2010–2022—the authors trace how dataset size, labeling quality, and accessibility enabled rapid architectural experimentation, teaching diffusion, and early breakthroughs (notably AlexNet). Econometric results show CIFAR-10 papers garnered substantial patent citations in the early period and ongoing, though scientific citations waned after 2014, whereas ImageNet maintained a stronger and more persistent influence on scientific progress. The study highlights a division of labor between CIFAR-10’s technological impact and ImageNet’s ongoing scientific influence, arguing that small, open datasets played a critical and lasting role in the DL revolution and in open science diffusion, with implications for future data-sharing policy and funding strategies.

Abstract

We investigate the emergence of Deep Learning as a technoscientific field, emphasizing the role of open labeled datasets. Through qualitative and quantitative analyses, we evaluate the role of datasets like CIFAR-10 in advancing computer vision and object recognition, which are central to the Deep Learning revolution. Our findings highlight CIFAR-10's crucial role and enduring influence on the field, as well as its importance in teaching ML techniques. Results also indicate that dataset characteristics such as size, number of instances, and number of categories, were key factors. Econometric analysis confirms that CIFAR-10, a small-but-sufficiently-large open dataset, played a significant and lasting role in technological advancements and had a major function in the development of the early scientific literature as shown by citation metrics.
Paper Structure (21 sections, 2 equations, 9 figures, 16 tables)

This paper contains 21 sections, 2 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Distribution of Publications by Subject Area
  • Figure 2: The Rise of Annotated Image Datasets
  • Figure 3: Survey Results - CIFAR-10 Datasets Impact on DL & Computer Vision
  • Figure 4: Survey Results - Comparing CIFAR-10 with Similar Datasets
  • Figure 5: Survey Results - Integration of CIFAR-10 in Teaching environment
  • ...and 4 more figures