Table of Contents
Fetching ...

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

Jiefu Ou, William Gantt Walden, Kate Sanders, Zhengping Jiang, Kaiser Sun, Jeffrey Cheng, William Jurayj, Miriam Wanner, Shaobo Liang, Candice Morgan, Seunghoon Han, Weiqi Wang, Chandler May, Hannah Recknor, Daniel Khashabi, Benjamin Van Durme

TL;DR

ClaimCheck introduces a high-quality, claim-grounded peer-review dataset built from NeurIPS 2023/2024 rejected submissions and their reviews. It defines three claim-centric tasks—Claim Association, Weakness Labeling and Editing, and Claim Verification—and benchmarks multimodal LLMs on these tasks. Across CA, WLE, and CV, current models struggle to ground reviewer weaknesses to precise claims and to verify claims with grounded reasoning, though they can assist with fine-grained labeling and editing under supervision. The work provides a valuable resource and evaluation suite to spur progress toward automated, claim-grounded peer review and discusses limitations and ethical considerations. ClaimCheck thus offers both a dataset and a thoughtful benchmarking framework to push toward more trustworthy, grounded AI-assisted peer review.

Abstract

A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated dataset of NeurIPS 2023 and 2024 submissions and reviews mined from OpenReview. CLAIMCHECK is richly annotated by ML experts for weakness statements in the reviews and the paper claims that they dispute, as well as fine-grained labels of the validity, objectivity, and type of the identified weaknesses. We benchmark several LLMs on three claim-centric tasks supported by CLAIMCHECK, requiring models to (1) associate weaknesses with the claims they dispute, (2) predict fine-grained labels for weaknesses and rewrite the weaknesses to enhance their specificity, and (3) verify a paper's claims with grounded reasoning. Our experiments reveal that cutting-edge LLMs, while capable of predicting weakness labels in (2), continue to underperform relative to human experts on all other tasks.

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

TL;DR

ClaimCheck introduces a high-quality, claim-grounded peer-review dataset built from NeurIPS 2023/2024 rejected submissions and their reviews. It defines three claim-centric tasks—Claim Association, Weakness Labeling and Editing, and Claim Verification—and benchmarks multimodal LLMs on these tasks. Across CA, WLE, and CV, current models struggle to ground reviewer weaknesses to precise claims and to verify claims with grounded reasoning, though they can assist with fine-grained labeling and editing under supervision. The work provides a valuable resource and evaluation suite to spur progress toward automated, claim-grounded peer review and discusses limitations and ethical considerations. ClaimCheck thus offers both a dataset and a thoughtful benchmarking framework to push toward more trustworthy, grounded AI-assisted peer review.

Abstract

A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated dataset of NeurIPS 2023 and 2024 submissions and reviews mined from OpenReview. CLAIMCHECK is richly annotated by ML experts for weakness statements in the reviews and the paper claims that they dispute, as well as fine-grained labels of the validity, objectivity, and type of the identified weaknesses. We benchmark several LLMs on three claim-centric tasks supported by CLAIMCHECK, requiring models to (1) associate weaknesses with the claims they dispute, (2) predict fine-grained labels for weaknesses and rewrite the weaknesses to enhance their specificity, and (3) verify a paper's claims with grounded reasoning. Our experiments reveal that cutting-edge LLMs, while capable of predicting weakness labels in (2), continue to underperform relative to human experts on all other tasks.

Paper Structure

This paper contains 42 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: ClaimCheck is sourced from rejected NeurIPS submissions (a) and their corresponding reviews from OpenReview (b). Annotations identify claim-related weaknesses in reviews (blue arrows to Weakness 1 & 2) and provide fine-grained labels (c). Weaknesses are grounded to specific target claims (Claim 27 & 1, respectively) that they dispute (black arrows from (c) to (d)). These target claims are identified from a set of claims extracted from the original paper (purple arrows to from (a) to (d)). Grounding weaknesses in a paper's claims is essential in peer review.
  • Figure 2: Distribution of the various weakness labels for ClaimCheck: groundedness confidence scores (top left), weakness types (top right), subjectivity scores (bottom left), and agreement scores (bottom right). See §\ref{['sec:data::annotation']}.
  • Figure 3: The results of the Claim Association (CA) task (§\ref{['sec:experiments::claim-association']}). Left: Avg. pairwise $\text{F}_{1,edit}$ and $\text{F}_{1,exact}$ (see §\ref{['sec:data::annotation']}) between (1) human annotators (Humans Only) and (2) each model and all humans on the Pilot data. Right: Avg. model $\text{F}_1$ w.r.t. single human annotation on the Main data. On the Pilot data, all LLMs show lower performance than the expert (human) average. On the Main data, while o1 achieves the highest $F_1$ scores, the low absolute scores of all models indicate that the evaluated LLMs all struggle in grounding weaknesses to claims.
  • Figure 4: Annotation interface for the Weakness Identification (WI) subtask. Annotators select contiguous spans from from the review text (top left), each describing a weakness raised by the reviewer. For each weakness, annotators supply a Likert-scale judgment (top right) indicating the extent to which they believe the weakness targets a specific claim made in the paper (bottom left). Annotators select as many weaknesses as they can find in the review that plausibly target some claim. The paper in this example (and in Figures \ref{['fig:claim-association-part-1']}-\ref{['fig:claim-association-part-3']}) is jiang-etal-2023-accelerating.
  • Figure 5: Annotation interface showing part of the Claim Association (CA) subtasks. Given (1) the weaknesses identified for a given review during the Weakness Identification (WI) subtask (\ref{['fig:weakness-identification']}) and (2) a set of candidate claims extracted by GPT-4o, annotators must determine which of these claims are targeted by each weakness (if any). Although during the annotation we also ask annotators to provide type labels for each candidate target claim, we find these labels do not provide necessary information for other annotation subtasks or for LLM reasoning and decide to drop it from the final dataset/evaluation.
  • ...and 2 more figures