Assessing Reproducibility in Evolutionary Computation: A Case Study using Human- and LLM-based Assessment

Francesca Da Ros; Tarik Začiragić; Aske Plaat; Thomas Bäck; Niki van Stein

Assessing Reproducibility in Evolutionary Computation: A Case Study using Human- and LLM-based Assessment

Francesca Da Ros, Tarik Začiragić, Aske Plaat, Thomas Bäck, Niki van Stein

TL;DR

This study scrutinizes reproducibility practices in evolutionary computation by applying a structured reproducibility checklist to 168 full papers from GECCO's Evolutionary Combinatorial Optimization and Metaheuristics track (2016–2025) and introducing RECAP, an LLM-based pipeline that automates reproducibility assessment. It finds a moderate overall completeness ($\approx 0.62$) and relatively low artifact sharing ($\approx 37\%$), with automated assessments achieving substantial agreement with human judgments ($\kappa\approx 0.67$). The work highlights persistent gaps in reporting operational details, such as artifact availability and random seeds, and demonstrates the feasibility and limitations of using automation to monitor reproducibility at scale. Practical recommendations emphasize persistent artifact hosting, checklist-guided reporting, and human–in–the–loop review to enhance reproducibility in empirical evolutionary computation research.

Abstract

Reproducibility is an important requirement in evolutionary computation, where results largely depend on computational experiments. In practice, reproducibility relies on how algorithms, experimental protocols, and artifacts are documented and shared. Despite growing awareness, there is still limited empirical evidence on the actual reproducibility levels of published work in the field. In this paper, we study the reproducibility practices in papers published in the Evolutionary Combinatorial Optimization and Metaheuristics track of the Genetic and Evolutionary Computation Conference over a ten-year period. We introduce a structured reproducibility checklist and apply it through a systematic manual assessment of the selected corpus. In addition, we propose RECAP (REproducibility Checklist Automation Pipeline), an LLM-based system that automatically evaluates reproducibility signals from paper text and associated code repositories. Our analysis shows that papers achieve an average completeness score of 0.62, and that 36.90% of them provide additional material beyond the manuscript itself. We demonstrate that automated assessment is feasible: RECAP achieves substantial agreement with human evaluators (Cohen's k of 0.67). Together, these results highlight persistent gaps in reproducibility reporting and suggest that automated tools can effectively support large-scale, systematic monitoring of reproducibility practices.

Assessing Reproducibility in Evolutionary Computation: A Case Study using Human- and LLM-based Assessment

TL;DR

) and relatively low artifact sharing (

), with automated assessments achieving substantial agreement with human judgments (

). The work highlights persistent gaps in reporting operational details, such as artifact availability and random seeds, and demonstrates the feasibility and limitations of using automation to monitor reproducibility at scale. Practical recommendations emphasize persistent artifact hosting, checklist-guided reporting, and human–in–the–loop review to enhance reproducibility in empirical evolutionary computation research.

Abstract

Paper Structure (27 sections, 13 figures, 2 tables)

This paper contains 27 sections, 13 figures, 2 tables.

Introduction
Related Work
Methodology
Material Collection
Reproducibility Checklist
Manual Assessment Protocol
RECAP: LLM-based Assessment Protocol
Results
Reproducibility Assessment
Paper-level analysis
Per-item analysis
Artifact analysis
Best paper candidates
Manual vs. Automated Assessment
Discussion
...and 12 more sections

Figures (13)

Figure 1: model of the manual assessment of a paper.
Figure 2: RECAP system overview. The system processes each paper through a field-by-field evaluation loop. Based on field type, it either uses the paper text directly (Std), retrieves cached best paper website data (BP), or processes linked repositories (Art). Each field is evaluated by an with the appropriate context, and some fields invoke tools (e.g., sandbox execution). Results are parsed and stored per paper.
Figure 3: Paper-level completeness of reproducibility reporting across years. Individual papers are shown as points, while boxplots indicate quartiles and lines identify yearly trends.
Figure 4: Reporting rate (left) per item and outcomes (Y, N, and NA) across papers (right).
Figure 5: Proportion of papers providing some additional material other than the paper itself (supplementary material or external artifact) over time.
...and 8 more figures

Assessing Reproducibility in Evolutionary Computation: A Case Study using Human- and LLM-based Assessment

TL;DR

Abstract

Assessing Reproducibility in Evolutionary Computation: A Case Study using Human- and LLM-based Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (13)