ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations
Tyler Loakman, Chenghua Lin
TL;DR
This work presents a partial reproduction of Generating Fact Checking Explanations within the ReproHum framework to assess the reproducibility of human evaluation in NLP. It focuses on the Coverage criterion, re-analyzing 40 inputs across a gold standard and two model explanations with three evaluators. The findings show patterns similar to the original study, with the Explain-MT model outperforming Explain-Extr and gold explanations often strong, and a significant but imperfect alignment with the original results (Spearman ρ = 0.524, Pearson r = 0.541, p < 0.01). Overall, the reproduction supports the original conclusions about the efficacy of jointly trained explanation generation while highlighting variability due to small evaluator panels and the single-evaluation focus. These results underscore the value of reproduction efforts for validating NLP explainability studies and the ongoing need for robust human evaluation methodologies.
Abstract
This paper presents a partial reproduction of Generating Fact Checking Explanations by Anatanasova et al (2020) as part of the ReproHum element of the ReproNLP shared task to reproduce the findings of NLP research regarding human evaluation. This shared task aims to investigate the extent to which NLP as a field is becoming more or less reproducible over time. Following the instructions provided by the task organisers and the original authors, we collect relative rankings of 3 fact-checking explanations (comprising a gold standard and the outputs of 2 models) for 40 inputs on the criteria of Coverage. The results of our reproduction and reanalysis of the original work's raw results lend support to the original findings, with similar patterns seen between the original work and our reproduction. Whilst we observe slight variation from the original results, our findings support the main conclusions drawn by the original authors pertaining to the efficacy of their proposed models.
