Table of Contents
Fetching ...

Containing the Reproducibility Gap: Automated Repository-Level Containerization for Scholarly Jupyter Notebooks

Sheeba Samuel, Daniel Mietchen, Hemanta Lo, Martin Gaedke

Abstract

Computational reproducibility is fundamental to trustworthy science, yet remains difficult to achieve in practice across various research workflows, including Jupyter notebooks published alongside scholarly articles. Environment drift, undocumented dependencies and implicit execution assumptions frequently prevent independent re-execution of published research. Despite existing reproducibility guidelines, scalable and systematic infrastructure for automated assessment remains limited. We present an automated, web-oriented reproducibility engineering pipeline that reconstructs and evaluates repository-level execution environments for scholarly notebooks. The system performs dependency inference, automated container generation, and isolated execution to approximate the notebook's original computational context. We evaluate the approach on 443 notebooks from 116 GitHub repositories referenced by publications in PubMed Central. Execution outcomes are classified into four categories: resolved environment failures, persistent logic or data errors, reproducibility drift, and container-induced regressions. Our results show that containerization resolves 66.7% of prior dependency-related failures and substantially improves execution robustness. However, a significant reproducibility gap remains: 53.7% of notebooks exhibit low output fidelity, largely due to persistent runtime failures and stochastic non-determinism. These findings indicate that standardized containerization is essential for computational stability but insufficient for full bit-wise reproducibility. The framework offers a scalable solution for researchers, editors, and archivists seeking systematic, automated assessment of computational artifacts.

Containing the Reproducibility Gap: Automated Repository-Level Containerization for Scholarly Jupyter Notebooks

Abstract

Computational reproducibility is fundamental to trustworthy science, yet remains difficult to achieve in practice across various research workflows, including Jupyter notebooks published alongside scholarly articles. Environment drift, undocumented dependencies and implicit execution assumptions frequently prevent independent re-execution of published research. Despite existing reproducibility guidelines, scalable and systematic infrastructure for automated assessment remains limited. We present an automated, web-oriented reproducibility engineering pipeline that reconstructs and evaluates repository-level execution environments for scholarly notebooks. The system performs dependency inference, automated container generation, and isolated execution to approximate the notebook's original computational context. We evaluate the approach on 443 notebooks from 116 GitHub repositories referenced by publications in PubMed Central. Execution outcomes are classified into four categories: resolved environment failures, persistent logic or data errors, reproducibility drift, and container-induced regressions. Our results show that containerization resolves 66.7% of prior dependency-related failures and substantially improves execution robustness. However, a significant reproducibility gap remains: 53.7% of notebooks exhibit low output fidelity, largely due to persistent runtime failures and stochastic non-determinism. These findings indicate that standardized containerization is essential for computational stability but insufficient for full bit-wise reproducibility. The framework offers a scalable solution for researchers, editors, and archivists seeking systematic, automated assessment of computational artifacts.

Paper Structure

This paper contains 11 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Modified workflow from samuel2024computational, adapted to incorporate automated environment reconstruction and containerized execution. Elements in dark blue signify newly integrated components, while light blue elements represent the original, unmodified process steps.
  • Figure 2: Comparison of error types in baseline and containerized pipeline
  • Figure 3: Repository-level success rate (of reproducing the results of the original notebook) by requirements status, for both pipelines
  • Figure 4: Analysis of output fidelity for the containerized pipeline: (a) distribution of Reproducibility Scores for notebooks with vs. without requirements.txt files; (b) classification of notebooks by Reproducibility Score categories.