CiteME: Can Language Models Accurately Cite Scientific Claims?
Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, Matthias Bethge
TL;DR
This paper introduces CiteME, a manually curated benchmark for open-ended citation attribution that demands links between claims and a single cited paper. It demonstrates a large gap between human performance (69.7%) and current language models (4.2–18.5%), and presents CiteAgent, a GPT-4o-based autonomous agent that uses real-time search and paper reading to attribute citations, achieving 35.3% accuracy. The work highlights the challenges of automatic claim attribution in scientific text, evaluates the limits of retrieval-augmented LM systems, and suggests a path toward verifiable LM-generated content in scholarly contexts. Overall, CiteME provides a rigorous testbed for measuring and improving automated verification of LM-derived scientific claims.
Abstract
Thousands of new scientific papers are published each month. Such information overload complicates researcher efforts to stay current with the state-of-the-art as well as to verify and correctly attribute claims. We pose the following research question: Given a text excerpt referencing a paper, could an LM act as a research assistant to correctly identify the referenced paper? We advance efforts to answer this question by building a benchmark that evaluates the abilities of LMs in citation attribution. Our benchmark, CiteME, consists of text excerpts from recent machine learning papers, each referencing a single other paper. CiteME use reveals a large gap between frontier LMs and human performance, with LMs achieving only 4.2-18.5% accuracy and humans 69.7%. We close this gap by introducing CiteAgent, an autonomous system built on the GPT-4o LM that can also search and read papers, which achieves an accuracy of 35.3\% on CiteME. Overall, CiteME serves as a challenging testbed for open-ended claim attribution, driving the research community towards a future where any claim made by an LM can be automatically verified and discarded if found to be incorrect.
