Table of Contents
Fetching ...

Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

Nils Dycke, Iryna Gurevych

TL;DR

This paper addresses the problem of assessing whether automatic review generators (ARGs) can detect faulty research logic in scientific papers. It introduces a fully automatic counterfactual evaluation framework that formalizes paper soundness via a hierarchical research logic model, then generates soundness-critical and soundness-neutral counterfactuals to measure their impact on automated reviews. Across 133 AI/NLP papers, yielding 931 counterfactuals, the study finds that faults in reasoning do not significantly alter the content or sentiment of ARG-generated reviews, challenging the practical reliability of current ARGs for peer review. The authors release the counterfactual dataset and framework, and offer recommendations emphasizing skill-specific evaluation, human–AI collaboration, and improved assessment practices to advance robust, trustworthy automated reviewing.

Abstract

Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper's results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.

Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

TL;DR

This paper addresses the problem of assessing whether automatic review generators (ARGs) can detect faulty research logic in scientific papers. It introduces a fully automatic counterfactual evaluation framework that formalizes paper soundness via a hierarchical research logic model, then generates soundness-critical and soundness-neutral counterfactuals to measure their impact on automated reviews. Across 133 AI/NLP papers, yielding 931 counterfactuals, the study finds that faults in reasoning do not significantly alter the content or sentiment of ARG-generated reviews, challenging the practical reliability of current ARGs for peer review. The authors release the counterfactual dataset and framework, and offer recommendations emphasizing skill-specific evaluation, human–AI collaboration, and improved assessment practices to advance robust, trustworthy automated reviewing.

Abstract

Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper's results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.

Paper Structure

This paper contains 61 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: A paper's research logic. Each building block is evidenced in the paper (dotted lines) and jointly entail the building block above in the hierarchy (arrows).
  • Figure 2: The counterfactual evaluation pipeline takes the original paper as an input and results in the evaluator's output. RL = research logic; CF = counterfactual; ARG = automatic review generator.
  • Figure 3: ATE of aspect differences for soundness-critical and -neutral CFs; the larger the difference of neutral/critical means (dashed horizontal lines) the better.
  • Figure 4: ATE of sentiment differences for soundness-critical and -neutral CFs; the larger the difference of neutral/critical means (dashed horizontal lines) the better.
  • Figure 5: ATE of score differences for soundness-critical and -neutral CFs; the larger the difference of neutral/critical means (dashed horizontal lines) the better.
  • ...and 3 more figures