Table of Contents
Fetching ...

Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback

Osama Mohammed Afzal, Preslav Nakov, Tom Hope, Iryna Gurevych

TL;DR

This paper tackles the challenge of scalable novelty assessment in peer review by proposing a three-stage, LLM-assisted pipeline that mirrors expert reviewer behavior. It combines document processing, comprehensive related-work retrieval and ranking, and a structured novelty delta analysis to produce evidence-based evaluations. Evaluated on 182 ICLR 2025 submissions with human-annotated novelty assessments, the approach achieves high alignment with human reasoning (86.5%) and novelty conclusions (75.3%), outperforming baseline AI systems. The work demonstrates that careful prompt design and literature-grounded analysis can enhance rigor and transparency in peer review without replacing human expertise.

Abstract

Novelty assessment is a central yet understudied aspect of peer review, particularly in high volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence based assessment. Our method is informed by a large scale analysis of human written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions - substantially outperforming existing LLM based baselines. The method produces detailed, literature aware analyses and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM assisted approaches to support more rigorous and transparent peer review without displacing human expertise. Data and code are made available.

Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback

TL;DR

This paper tackles the challenge of scalable novelty assessment in peer review by proposing a three-stage, LLM-assisted pipeline that mirrors expert reviewer behavior. It combines document processing, comprehensive related-work retrieval and ranking, and a structured novelty delta analysis to produce evidence-based evaluations. Evaluated on 182 ICLR 2025 submissions with human-annotated novelty assessments, the approach achieves high alignment with human reasoning (86.5%) and novelty conclusions (75.3%), outperforming baseline AI systems. The work demonstrates that careful prompt design and literature-grounded analysis can enhance rigor and transparency in peer review without replacing human expertise.

Abstract

Novelty assessment is a central yet understudied aspect of peer review, particularly in high volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence based assessment. Our method is informed by a large scale analysis of human written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions - substantially outperforming existing LLM based baselines. The method produces detailed, literature aware analyses and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM assisted approaches to support more rigorous and transparent peer review without displacing human expertise. Data and code are made available.

Paper Structure

This paper contains 61 sections, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Automated novelty assessment pipeline. The system processes manuscripts through three stages: (1) Document Processing extracts content using GROBID, (2) Related Work Discovery identifies and ranks relevant papers via embedding similarity and LLM reranking, and (3) Novelty Assessment performs structured analysis to generate evidence-based novelty evaluations.
  • Figure 2: Overall performance comparison between our system and three baseline systems based on human evaluation (n values indicates number of comparisons)
  • Figure 3: Performance breakdown across evaluation categories, aggregated across all baseline comparisons.
  • Figure 4: Distribution of the number of reviews per paper. Most papers received 1 to 4 reviews.
  • Figure 5: Screenshot of the custom-built interface used for human evaluation. Annotators compared AI-generated and human-written novelty assessments across multiple dimensions, including reasoning depth, prior work engagement, and conclusion alignment.
  • ...and 9 more figures