Table of Contents
Fetching ...

Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications

Ethan Lin, Zhiyuan Peng, Yi Fang

TL;DR

This work addresses the gap in evaluating LLMs for novelty in scholarly publications by introducing SchNovel, a benchmark of 15,000 arXiv paper pairs across six fields with 2–10 year gaps. It couples a retrieval-augmented approach, RAG-Novelty, with a self-reflection prompting strategy to improve novelty judgments by incorporating recent related work. Through extensive experiments across fields, start years, year gaps, and metadata, the authors show RAG-Novelty outperforms strong baselines, while identifying domain biases and metadata effects that warrant mitigation. The findings suggest practical pathways for more accurate, scalable scholarly novelty assessment and point to future work on expanding data, understanding what textual components best encode novelty, and addressing biases in evaluation.

Abstract

Recent studies have evaluated the creativity/novelty of large language models (LLMs) primarily from a semantic perspective, using benchmarks from cognitive science. However, accessing the novelty in scholarly publications is a largely unexplored area in evaluating LLMs. In this paper, we introduce a scholarly novelty benchmark (SchNovel) to evaluate LLMs' ability to assess novelty in scholarly papers. SchNovel consists of 15000 pairs of papers across six fields sampled from the arXiv dataset with publication dates spanning 2 to 10 years apart. In each pair, the more recently published paper is assumed to be more novel. Additionally, we propose RAG-Novelty, which simulates the review process taken by human reviewers by leveraging the retrieval of similar papers to assess novelty. Extensive experiments provide insights into the capabilities of different LLMs to assess novelty and demonstrate that RAG-Novelty outperforms recent baseline models.

Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications

TL;DR

This work addresses the gap in evaluating LLMs for novelty in scholarly publications by introducing SchNovel, a benchmark of 15,000 arXiv paper pairs across six fields with 2–10 year gaps. It couples a retrieval-augmented approach, RAG-Novelty, with a self-reflection prompting strategy to improve novelty judgments by incorporating recent related work. Through extensive experiments across fields, start years, year gaps, and metadata, the authors show RAG-Novelty outperforms strong baselines, while identifying domain biases and metadata effects that warrant mitigation. The findings suggest practical pathways for more accurate, scalable scholarly novelty assessment and point to future work on expanding data, understanding what textual components best encode novelty, and addressing biases in evaluation.

Abstract

Recent studies have evaluated the creativity/novelty of large language models (LLMs) primarily from a semantic perspective, using benchmarks from cognitive science. However, accessing the novelty in scholarly publications is a largely unexplored area in evaluating LLMs. In this paper, we introduce a scholarly novelty benchmark (SchNovel) to evaluate LLMs' ability to assess novelty in scholarly papers. SchNovel consists of 15000 pairs of papers across six fields sampled from the arXiv dataset with publication dates spanning 2 to 10 years apart. In each pair, the more recently published paper is assumed to be more novel. Additionally, we propose RAG-Novelty, which simulates the review process taken by human reviewers by leveraging the retrieval of similar papers to assess novelty. Extensive experiments provide insights into the capabilities of different LLMs to assess novelty and demonstrate that RAG-Novelty outperforms recent baseline models.
Paper Structure (34 sections, 6 figures, 3 tables, 1 algorithm)

This paper contains 34 sections, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: The overview of RAG-Novelty
  • Figure 2: Pointwise vs. pairwise. The metrics above were obtained in the cs field with the start year $s=2023$ and GPT-4o-mini.
  • Figure 3: Comparison of fields. The metrics above were obtained using Self-Reflection in cs field with the start year $s=2023$ with GPT-4o-mini.
  • Figure 4: Comparison of Start Years. The metrics above were obtained using Self-Reflection in the cs field with GPT-4o-mini.
  • Figure 5: Comparison of different organizations. The metrics above were obtained using Self-Reflection in the cs field with start year $s=2023$ and GPT-4o-mini.
  • ...and 1 more figures