Table of Contents
Fetching ...

From Protoscience to Epistemic Monoculture: How Benchmarking Set the Stage for the Deep Learning Revolution

Bernard J. Koch, David Peterson

TL;DR

The paper investigates why AI research shifted from autonomous, exploratory science to an externally steered, benchmark-driven trajectory, coalescing into an epistemic monoculture centered on deep learning. Through interviews, archival sources, and large-scale corpus analysis, it traces three eras: the diverse but inconclusive Symbolic AI period, the benchmarking era that centralized progress around predictive accuracy, and the Deep Learning monoculture driven by data/compute and industry leadership. It argues that ceding autonomy and adopting single-metric benchmarking accelerated progress but narrowed epistemic perspectives, with significant social and environmental implications. The emergence of generative AI tests the resilience of this monoculture and raises critical questions about autonomy, reproducibility, and the diffusion of AI governance across disciplines.

Abstract

Over the past decade, AI research has focused heavily on building ever-larger deep learning models. This approach has simultaneously unlocked incredible achievements in science and technology, and hindered AI from overcoming long-standing limitations with respect to explainability, ethical harms, and environmental efficiency. Drawing on qualitative interviews and computational analyses, our three-part history of AI research traces the creation of this "epistemic monoculture" back to a radical reconceptualization of scientific progress that began in the late 1980s. In the first era of AI research (1950s-late 1980s), researchers and patrons approached AI as a "basic" science that would advance through autonomous exploration and organic assessments of progress (e.g., peer-review, theoretical consensus). The failure of this approach led to a retrenchment of funding in the 1980s. Amid this "AI Winter," an intervention by the U.S. government reoriented the field towards measurable progress on tasks of military and commercial interest. A new evaluation system called "benchmarking" provided an objective way to quantify progress on tasks by focusing exclusively on increasing predictive accuracy on example datasets. Distilling science down to verifiable metrics clarified the roles of scientists, allowed the field to rapidly integrate talent, and provided clear signals of significance and progress. But history has also revealed a tradeoff to this streamlined approach to science: the consolidation around external interests and inherent conservatism of benchmarking has disincentivized exploration beyond scaling monoculture. In the discussion, we explain how AI's monoculture offers a compelling challenge to the belief that basic, exploration-driven research is needed for scientific progress. Implications for the spread of AI monoculture to other sciences in the era of generative AI are also discussed.

From Protoscience to Epistemic Monoculture: How Benchmarking Set the Stage for the Deep Learning Revolution

TL;DR

The paper investigates why AI research shifted from autonomous, exploratory science to an externally steered, benchmark-driven trajectory, coalescing into an epistemic monoculture centered on deep learning. Through interviews, archival sources, and large-scale corpus analysis, it traces three eras: the diverse but inconclusive Symbolic AI period, the benchmarking era that centralized progress around predictive accuracy, and the Deep Learning monoculture driven by data/compute and industry leadership. It argues that ceding autonomy and adopting single-metric benchmarking accelerated progress but narrowed epistemic perspectives, with significant social and environmental implications. The emergence of generative AI tests the resilience of this monoculture and raises critical questions about autonomy, reproducibility, and the diffusion of AI governance across disciplines.

Abstract

Over the past decade, AI research has focused heavily on building ever-larger deep learning models. This approach has simultaneously unlocked incredible achievements in science and technology, and hindered AI from overcoming long-standing limitations with respect to explainability, ethical harms, and environmental efficiency. Drawing on qualitative interviews and computational analyses, our three-part history of AI research traces the creation of this "epistemic monoculture" back to a radical reconceptualization of scientific progress that began in the late 1980s. In the first era of AI research (1950s-late 1980s), researchers and patrons approached AI as a "basic" science that would advance through autonomous exploration and organic assessments of progress (e.g., peer-review, theoretical consensus). The failure of this approach led to a retrenchment of funding in the 1980s. Amid this "AI Winter," an intervention by the U.S. government reoriented the field towards measurable progress on tasks of military and commercial interest. A new evaluation system called "benchmarking" provided an objective way to quantify progress on tasks by focusing exclusively on increasing predictive accuracy on example datasets. Distilling science down to verifiable metrics clarified the roles of scientists, allowed the field to rapidly integrate talent, and provided clear signals of significance and progress. But history has also revealed a tradeoff to this streamlined approach to science: the consolidation around external interests and inherent conservatism of benchmarking has disincentivized exploration beyond scaling monoculture. In the discussion, we explain how AI's monoculture offers a compelling challenge to the belief that basic, exploration-driven research is needed for scientific progress. Implications for the spread of AI monoculture to other sciences in the era of generative AI are also discussed.
Paper Structure (25 sections, 3 figures)

This paper contains 25 sections, 3 figures.

Figures (3)

  • Figure 1: The process of benchmarking. Top Left: A “benchmark” consists of a task that is part of a larger a problem, a dataset that is representative for that task, and a metric (usually some version of accuracy) that scientists must build an algorithm to maximize. Top Right: Scientists then set their algorithms to compete on the benchmark. The algorithm with lowest error (highest accuracy) wins the grant. Bottom Left: A typical benchmarking table that appears in a machine learning paper. Authors bold their algorithm’s scores to highlight that they achieved state of the art accuracy/error scores. Bottom Right: Hypothetical benchmarking curve for a task community over time. Gradual lowering of the state-of-the-art error score is “normal science” in Kuhnian terms. A large jump in the state-of-the-art suggests a significant innovation.
  • Figure 2: Learning Curve (left) and Benchmarking Table (right). Learning curves, which depict the tradeoff between learning efficiency and accuracy, have been largely replaced by accuracy-only benchmarking tables in MLR.
  • Figure 3: Estimated rates of publication in AI on various different machine learning techniques from 1993 to 2018. Shaded areas represent credible intervals and dots represent statistically significant rate shifts. Deep learning explodes in 2013, while research on other methods remains stagnant.