Automated Modernization of Machine Learning Engineering Notebooks for Reproducibility

Bihui Jin; Kaiyuan Wang; Pengyu Nie

Automated Modernization of Machine Learning Engineering Notebooks for Reproducibility

Bihui Jin, Kaiyuan Wang, Pengyu Nie

TL;DR

The paper addresses the reproducibility crisis in ML engineering notebooks caused by rapid environment evolution. It demonstrates that only a minority of Kaggle notebooks remain reproducible today and that backporting dependencies does not close the gap. The authors introduce MLEModernizer, an LLM-driven agent that iteratively executes notebooks and applies three targeted fixes—error-repair, runtime-reduction, and score-calibration—while treating the contemporary environment as a fixed constraint. On 7,402 non-reproducible notebooks, MLEModernizer achieves reproducibility for 5,492 (74.2%), with most repairs requiring only a few fixes and incurring modest cost (~$0.31 per notebook). The work provides a scalable path to validating and reusing MLE artifacts as software ecosystems continue to evolve, and it offers detailed analyses of fix types, error causes, and code modification scales to guide future research and tooling.

Abstract

Interactive computational notebooks (e.g., Jupyter notebooks) are widely used in machine learning engineering (MLE) to program and share end-to-end pipelines, from data preparation to model training and evaluation. However, environment erosion-the rapid evolution of hardware and software ecosystems for machine learning-has rendered many published MLE notebooks non-reproducible in contemporary environments, hindering code reuse and scientific progress. To quantify this gap, we study 12,720 notebooks mined from 79 popular Kaggle competitions: only 35.4% remain reproducible today. Crucially, we find that environment backporting, i.e., downgrading dependencies to match the submission time, does not improve reproducibility but rather introduces additional failure modes. To address environment erosion, we design and implement MLEModernizer, an LLM-driven agentic framework that treats the contemporary environment as a fixed constraint and modernizes notebook code to restore reproducibility. MLEModernizer iteratively executes notebooks, collects execution feedback, and applies targeted fixes in three types: error-repair, runtime-reduction, and score-calibration. Evaluated on 7,402 notebooks that are non-reproducible under the baseline environment, MLEModernizer makes 5,492 (74.2%) reproducible. MLEModernizer enables practitioners to validate, reuse, and maintain MLE artifacts as the hardware and software ecosystems continue to evolve.

Automated Modernization of Machine Learning Engineering Notebooks for Reproducibility

TL;DR

Abstract

Paper Structure (46 sections, 1 equation, 13 figures, 5 tables)

This paper contains 46 sections, 1 equation, 13 figures, 5 tables.

Introduction
Motivation
Reproducibility of MLE Notebooks
Dataset Collection
Competition Selection
Notebook Mining
Filtering Criteria
Execution Environment Setup
Environment and Hyper-Parameters
Containerized Execution and Runtime Measurement
Execution Outcome and Grading
Results
Environment Backporting
Backporting Algorithm
Dependency Analysis
...and 31 more sections

Figures (13)

Figure 1: An example notebook and our attempt to reproduce its results. Left: successful run on Kaggle; Right: our run failed due to a breaking change in XGBoost.
Figure 2: Distribution of notebook runtime.
Figure 3: Sankey diagram of notebook reproducibility flow from Baseline to Backporting.
Figure 4: MLEModernizer's workflow.
Figure 5: MLEModernizer successfully modernizes the notebook in Figure \ref{['fig:motivation-example']} ($\mathtt{s_r}$ = 0.93778 compared to $\mathtt{s_t}$ = 0.87511), by adapting the dataset's classes (1..7) to the ones required by the XGBoost classifier (0..6).
...and 8 more figures

Automated Modernization of Machine Learning Engineering Notebooks for Reproducibility

TL;DR

Abstract

Automated Modernization of Machine Learning Engineering Notebooks for Reproducibility

Authors

TL;DR

Abstract

Table of Contents

Figures (13)