A Dataset For Computational Reproducibility
Lázaro Costa, Susana Barbosa, Jácome Cunha
TL;DR
The paper addresses the reproducibility crisis in computational science by constructing a curated, cross-domain dataset of computational experiments with explicit dependencies and execution details to benchmark reproducibility tools. It collects 38 experiments from diverse domains, characterizes them, and evaluates reproducibility using eight tools, revealing a 47 percent success rate and highlighting gaps in documentation and environment compatibility. The curated dataset of 18 reproducible experiments, along with reproducibility packages and ACM artifact classifications, provides a practical benchmark for tool evaluation and encourages standardized documentation practices. Overall, the work underscores the need for more robust, adaptable reproducibility tools and better documentation to enhance transparency and reliability in computational research.
Abstract
Ensuring the reproducibility of scientific work is crucial as it allows the consistent verification of scientific claims and facilitates the advancement of knowledge by providing a reliable foundation for future research. However, scientific work based on computational artifacts, such as scripts for statistical analysis or software prototypes, faces significant challenges in achieving reproducibility. These challenges are based on the variability of computational environments, rapid software evolution, and inadequate documentation of procedures. As a consequence, such artifacts often are not (easily) reproducible, undermining the credibility of scientific findings. The evaluation of reproducibility approaches, in particular of tools, is challenging in many aspects, one being the need to test them with the correct inputs, in this case computational experiments. Thus, this article introduces a curated dataset of computational experiments covering a broad spectrum of scientific fields, incorporating details about software dependencies, execution steps, and configurations necessary for accurate reproduction. The dataset is structured to reflect diverse computational requirements and methodologies, ranging from simple scripts to complex, multi-language workflows, ensuring it presents the wide range of challenges researchers face in reproducing computational studies. It provides a universal benchmark by establishing a standardized dataset for objectively evaluating and comparing the effectiveness of reproducibility tools. Each experiment included in the dataset is carefully documented to ensure ease of use. We added clear instructions following a standard, so each experiment has the same kind of instructions, making it easier for researchers to run each of them with their own reproducibility tool.
