Assessing Reproducibility in Evolutionary Computation: A Case Study using Human- and LLM-based Assessment
Francesca Da Ros, Tarik Začiragić, Aske Plaat, Thomas Bäck, Niki van Stein
TL;DR
This study scrutinizes reproducibility practices in evolutionary computation by applying a structured reproducibility checklist to 168 full papers from GECCO's Evolutionary Combinatorial Optimization and Metaheuristics track (2016–2025) and introducing RECAP, an LLM-based pipeline that automates reproducibility assessment. It finds a moderate overall completeness ($\approx 0.62$) and relatively low artifact sharing ($\approx 37\%$), with automated assessments achieving substantial agreement with human judgments ($\kappa\approx 0.67$). The work highlights persistent gaps in reporting operational details, such as artifact availability and random seeds, and demonstrates the feasibility and limitations of using automation to monitor reproducibility at scale. Practical recommendations emphasize persistent artifact hosting, checklist-guided reporting, and human–in–the–loop review to enhance reproducibility in empirical evolutionary computation research.
Abstract
Reproducibility is an important requirement in evolutionary computation, where results largely depend on computational experiments. In practice, reproducibility relies on how algorithms, experimental protocols, and artifacts are documented and shared. Despite growing awareness, there is still limited empirical evidence on the actual reproducibility levels of published work in the field. In this paper, we study the reproducibility practices in papers published in the Evolutionary Combinatorial Optimization and Metaheuristics track of the Genetic and Evolutionary Computation Conference over a ten-year period. We introduce a structured reproducibility checklist and apply it through a systematic manual assessment of the selected corpus. In addition, we propose RECAP (REproducibility Checklist Automation Pipeline), an LLM-based system that automatically evaluates reproducibility signals from paper text and associated code repositories. Our analysis shows that papers achieve an average completeness score of 0.62, and that 36.90% of them provide additional material beyond the manuscript itself. We demonstrate that automated assessment is feasible: RECAP achieves substantial agreement with human evaluators (Cohen's k of 0.67). Together, these results highlight persistent gaps in reproducibility reporting and suggest that automated tools can effectively support large-scale, systematic monitoring of reproducibility practices.
