CiteME: Can Language Models Accurately Cite Scientific Claims?

Ori Press; Andreas Hochlehnert; Ameya Prabhu; Vishaal Udandarao; Ofir Press; Matthias Bethge

CiteME: Can Language Models Accurately Cite Scientific Claims?

Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, Matthias Bethge

TL;DR

This paper introduces CiteME, a manually curated benchmark for open-ended citation attribution that demands links between claims and a single cited paper. It demonstrates a large gap between human performance (69.7%) and current language models (4.2–18.5%), and presents CiteAgent, a GPT-4o-based autonomous agent that uses real-time search and paper reading to attribute citations, achieving 35.3% accuracy. The work highlights the challenges of automatic claim attribution in scientific text, evaluates the limits of retrieval-augmented LM systems, and suggests a path toward verifiable LM-generated content in scholarly contexts. Overall, CiteME provides a rigorous testbed for measuring and improving automated verification of LM-derived scientific claims.

Abstract

Thousands of new scientific papers are published each month. Such information overload complicates researcher efforts to stay current with the state-of-the-art as well as to verify and correctly attribute claims. We pose the following research question: Given a text excerpt referencing a paper, could an LM act as a research assistant to correctly identify the referenced paper? We advance efforts to answer this question by building a benchmark that evaluates the abilities of LMs in citation attribution. Our benchmark, CiteME, consists of text excerpts from recent machine learning papers, each referencing a single other paper. CiteME use reveals a large gap between frontier LMs and human performance, with LMs achieving only 4.2-18.5% accuracy and humans 69.7%. We close this gap by introducing CiteAgent, an autonomous system built on the GPT-4o LM that can also search and read papers, which achieves an accuracy of 35.3\% on CiteME. Overall, CiteME serves as a challenging testbed for open-ended claim attribution, driving the research community towards a future where any claim made by an LM can be automatically verified and discarded if found to be incorrect.

CiteME: Can Language Models Accurately Cite Scientific Claims?

TL;DR

Abstract

Paper Structure (22 sections, 18 figures, 8 tables)

This paper contains 22 sections, 18 figures, 8 tables.

Introduction
The CiteME Benchmark
CiteAgent
Experiment Setup
Results
Error Analysis
Analyzing the Succesful Runs
Benchmarking Reasoning Capability Improvements with Latest Models
Related Work
Conclusion
Excerpts from Citation Datasets
ACL-200
RefSeer
arXiv
FullTextPeerRead jeong2020context
...and 7 more sections

Figures (18)

Figure 1: Example of a CiteME instance. The input (left) is an excerpt from a published paper with an anonymized citation; the target answer (right) is the title of the cited paper.
Figure 2: (Left) The top 10 most frequent labels of papers in CiteME, as identified by GPT-4. Overly broad tags like "Machine Learning" or "Deep Networks" were excluded (see Appendix \ref{['sec:gpt_tags']} for details). (Right) Most excerpts in CiteME are from recent papers.
Figure 3: The demonstration trajectory we gave CiteAgent in the prompt.
Figure 4: Five CiteAgent trajectories on five different samples. CiteAgent often exhibits behavior not shown in the demonstration given in the prompt, for example: searching by citation count and then by relevance, and searching multiple times in a row. Gray dotted box: prompt demonstration; green dotted boxes: CiteAgent succeeds; red dotted boxes: CiteAgent fails.
Figure 5: CiteAgent trajectories on samples that were correctly predicted reveals differences in model behavior. GPT-4o reads more frequently than Claude 3 Opus and can correctly predict papers even after performing many actions.
...and 13 more figures

CiteME: Can Language Models Accurately Cite Scientific Claims?

TL;DR

Abstract

CiteME: Can Language Models Accurately Cite Scientific Claims?

Authors

TL;DR

Abstract

Table of Contents

Figures (18)