Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications
Ethan Lin, Zhiyuan Peng, Yi Fang
TL;DR
This work addresses the gap in evaluating LLMs for novelty in scholarly publications by introducing SchNovel, a benchmark of 15,000 arXiv paper pairs across six fields with 2–10 year gaps. It couples a retrieval-augmented approach, RAG-Novelty, with a self-reflection prompting strategy to improve novelty judgments by incorporating recent related work. Through extensive experiments across fields, start years, year gaps, and metadata, the authors show RAG-Novelty outperforms strong baselines, while identifying domain biases and metadata effects that warrant mitigation. The findings suggest practical pathways for more accurate, scalable scholarly novelty assessment and point to future work on expanding data, understanding what textual components best encode novelty, and addressing biases in evaluation.
Abstract
Recent studies have evaluated the creativity/novelty of large language models (LLMs) primarily from a semantic perspective, using benchmarks from cognitive science. However, accessing the novelty in scholarly publications is a largely unexplored area in evaluating LLMs. In this paper, we introduce a scholarly novelty benchmark (SchNovel) to evaluate LLMs' ability to assess novelty in scholarly papers. SchNovel consists of 15000 pairs of papers across six fields sampled from the arXiv dataset with publication dates spanning 2 to 10 years apart. In each pair, the more recently published paper is assumed to be more novel. Additionally, we propose RAG-Novelty, which simulates the review process taken by human reviewers by leveraging the retrieval of similar papers to assess novelty. Extensive experiments provide insights into the capabilities of different LLMs to assess novelty and demonstrate that RAG-Novelty outperforms recent baseline models.
