Table of Contents
Fetching ...

Leakage and the Reproducibility Crisis in ML-based Science

Sayash Kapoor, Arvind Narayanan

TL;DR

The paper investigates how data leakage undermines reproducibility in ML-based science, revealing systematic pitfalls across disciplines. It introduces an eight-type leakage taxonomy, demonstrates the impact via a civil war prediction case study, and proposes model info sheets to surface and prevent leakage. Empirical findings show that once leakage is corrected, complex ML methods do not outperform decades-old baselines, underscoring the need for standardized reporting and reproducibility infrastructure. The work advocates interdisciplinary collaboration to implement practical safeguards and restore credibility in ML-driven scientific claims.

Abstract

The use of machine learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this paper, we systematically investigate reproducibility issues in ML-based science. We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Specifically, through a survey of literature in research communities that adopted ML methods, we find 17 fields where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions. Based on our survey, we present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems. We argue for fundamental methodological changes to ML-based science so that cases of leakage can be caught before publication. To that end, we propose model info sheets for reporting scientific claims based on ML models that would address all types of leakage identified in our survey. To investigate the impact of reproducibility errors and the efficacy of model info sheets, we undertake a reproducibility study in a field where complex ML models are believed to vastly outperform older statistical models such as Logistic Regression (LR): civil war prediction. We find that all papers claiming the superior performance of complex ML models compared to LR models fail to reproduce due to data leakage, and complex ML models don't perform substantively better than decades-old LR models. While none of these errors could have been caught by reading the papers, model info sheets would enable the detection of leakage in each case.

Leakage and the Reproducibility Crisis in ML-based Science

TL;DR

The paper investigates how data leakage undermines reproducibility in ML-based science, revealing systematic pitfalls across disciplines. It introduces an eight-type leakage taxonomy, demonstrates the impact via a civil war prediction case study, and proposes model info sheets to surface and prevent leakage. Empirical findings show that once leakage is corrected, complex ML methods do not outperform decades-old baselines, underscoring the need for standardized reporting and reproducibility infrastructure. The work advocates interdisciplinary collaboration to implement practical safeguards and restore credibility in ML-driven scientific claims.

Abstract

The use of machine learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this paper, we systematically investigate reproducibility issues in ML-based science. We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Specifically, through a survey of literature in research communities that adopted ML methods, we find 17 fields where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions. Based on our survey, we present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems. We argue for fundamental methodological changes to ML-based science so that cases of leakage can be caught before publication. To that end, we propose model info sheets for reporting scientific claims based on ML models that would address all types of leakage identified in our survey. To investigate the impact of reproducibility errors and the efficacy of model info sheets, we undertake a reproducibility study in a field where complex ML models are believed to vastly outperform older statistical models such as Logistic Regression (LR): civil war prediction. We find that all papers claiming the superior performance of complex ML models compared to LR models fail to reproduce due to data leakage, and complex ML models don't perform substantively better than decades-old LR models. While none of these errors could have been caught by reading the papers, model info sheets would enable the detection of leakage in each case.
Paper Structure (40 sections, 5 figures, 7 tables)

This paper contains 40 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: A comparison of reported and corrected results in civil war prediction papers published in top political science journals. The main findings of each of these papers are invalid due to various forms of data leakage: muchlinski_comparing_2016 impute the training and test data together, colaresi_robot_2017 and wang_comparing_2019 incorrectly reuse an imputed dataset, and kaufman_improving_2019 use proxies for the target variable which causes data leakage. The use of model info sheets (\ref{['sec:model_cards']}) would detect leakage in every paper. When we correct these errors, complex ML models (such as Adaboost and Random Forests) do not perform substantively better than decades-old Logistic Regression models for civil war prediction in each case. Each column in the table outlines the impact of leakage on the results of a paper. The figure above each column shows the difference in performance that results from fixing leakage issues.
  • Figure 2: Number of political science papers containing the terms "civil war" AND "machine learning" in the dimensions database of academic research hook_dimensions_2018. Note the sharp increase in papers using ML methods in the last few years.
  • Figure A1: Distribution of the agexp variable for peace and war data points for different imputation steps in muchlinski_comparing_2016. Note that the distribution of peace instances in the test set (D) has a peak that is close to the distribution in the imputed training set (B, C) --- which allows the random forests model to learn the small range of values where peace data points are concentrated. While we report results for the agexp variable, similar trends appear across independent variables in the dataset.
  • Figure A2: Results of a simulation that showcase how imputing the training and test sets together leads to overoptimistic estimates of model performance. The 95% Confidence Intervals are too small to be seen.
  • Figure A3: The wide confidence intervals for sensitivities and specificities reported in Blair and Sambanis. Here, we visualize the escalation and cameo models for the 1 month and 6 month forecast in the base specification (reported in Figure 1 of their paper).