QRA++: Quantified Reproducibility Assessment for Common Types of Results in Natural Language Processing
Anya Belz
TL;DR
QRA++ addresses the lack of a quantitative basis for reproducibility in NLP by introducing a metrology-informed framework that yields continuous, cross-study reproducibility assessments across multiple granularity levels. It defines a standard set of experiment properties, four result types, and corresponding measures (CV^*, $r$, $\rho$, $\tau$, $W$, $\kappa$, $\alpha$, and $P$) to compare original and reproduced experiments. Through three illustrative examples, the paper shows how reproducibility depends on experiment similarity, evaluation method, and system type, and how QC-level analyses often reveal more informative patterns than system-level scores. The framework enables diagnosing causes of reproducibility gaps and provides a structured template for reporting reproducibility across NLP studies, with practical implications for improving experimental rigor and comparability.
Abstract
Reproduction studies reported in NLP provide individual data points which in combination indicate worryingly low levels of reproducibility in the field. Because each reproduction study reports quantitative conclusions based on its own, often not explicitly stated, criteria for reproduction success/failure, the conclusions drawn are hard to interpret, compare, and learn from. In this paper, we present QRA++, a quantitative approach to reproducibility assessment that (i) produces continuous-valued degree of reproducibility assessments at three levels of granularity; (ii) utilises reproducibility measures that are directly comparable across different studies; and (iii) grounds expectations about degree of reproducibility in degree of similarity between experiments. QRA++ enables more informative reproducibility assessments to be conducted, and conclusions to be drawn about what causes reproducibility to be better/poorer. We illustrate this by applying QRA++ to three example sets of comparable experiments, revealing clear evidence that degree of reproducibility depends on similarity of experiment properties, but also system type and evaluation method.
