Table of Contents
Fetching ...

BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

Delip Rao, Chris Callison-Burch

Abstract

Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-level errors. Prior evaluations tested base models without search, which does not reflect current practice. We construct a benchmark of 931 papers across four scientific domains and three citation tiers -- popular, low-citation, and recent post-cutoff -- designed to disentangle parametric memory from search dependence, with version-aware ground truth accounting for multiple citable versions of the same paper. Three search-enabled frontier models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) generate BibTeX entries scored on nine fields and a six-way error taxonomy, producing ~23,000 field-level observations. Overall accuracy is 83.6%, but only 50.9% of entries are fully correct; accuracy drops 27.7pp from popular to recent papers, revealing heavy reliance on parametric memory even when search is available. Field-error co-occurrence analysis identifies two failure modes: wholesale entry substitution (identity fields fail together) and isolated field error. We evaluate clibib, an open-source tool for deterministic BibTeX retrieval from the Zotero Translation Server with CrossRef fallback, as a mitigation mechanism. In a two-stage integration where baseline entries are revised against authoritative records, accuracy rises +8.0pp to 91.5%, fully correct entries rise from 50.9% to 78.3%, and regression rate is only 0.8%. An ablation comparing single-stage and two-stage integration shows that separating search from revision yields larger gains and lower regression (0.8% vs. 4.8%), demonstrating that integration architecture matters independently of model capability. We release the benchmark, error taxonomy, and clibib tool to support evaluation and mitigation of citation hallucinations in LLM-based scientific writing.

BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

Abstract

Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-level errors. Prior evaluations tested base models without search, which does not reflect current practice. We construct a benchmark of 931 papers across four scientific domains and three citation tiers -- popular, low-citation, and recent post-cutoff -- designed to disentangle parametric memory from search dependence, with version-aware ground truth accounting for multiple citable versions of the same paper. Three search-enabled frontier models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) generate BibTeX entries scored on nine fields and a six-way error taxonomy, producing ~23,000 field-level observations. Overall accuracy is 83.6%, but only 50.9% of entries are fully correct; accuracy drops 27.7pp from popular to recent papers, revealing heavy reliance on parametric memory even when search is available. Field-error co-occurrence analysis identifies two failure modes: wholesale entry substitution (identity fields fail together) and isolated field error. We evaluate clibib, an open-source tool for deterministic BibTeX retrieval from the Zotero Translation Server with CrossRef fallback, as a mitigation mechanism. In a two-stage integration where baseline entries are revised against authoritative records, accuracy rises +8.0pp to 91.5%, fully correct entries rise from 50.9% to 78.3%, and regression rate is only 0.8%. An ablation comparing single-stage and two-stage integration shows that separating search from revision yields larger gains and lower regression (0.8% vs. 4.8%), demonstrating that integration architecture matters independently of model capability. We release the benchmark, error taxonomy, and clibib tool to support evaluation and mitigation of citation hallucinations in LLM-based scientific writing.

Paper Structure

This paper contains 72 sections, 21 figures, 14 tables, 2 algorithms.

Figures (21)

  • Figure 1: Paper counts by domain and citation tier after quality filtering. Each domain initially targeted 100 popular, 100 low-citation, and 50 recent papers; final counts reflect removal of 66 non-research items during the data quality audit.
  • Figure 2: Distribution of cited-by counts across citation tiers (log scale). Popular papers have median citations of 500--3,567; low-citation papers cluster near 0--7; recent papers are predominantly uncited. The clear separation confirms that the tiers probe distinct regions of model familiarity.
  • Figure 3: Version composition of papers by domain. Each bar shows the proportion of papers in each version combination. Quantum Computing has the highest multi-version rate (44.8%), driven by the physics preprint culture. AI is proceedings-dominated (86.2%), while Medicine and Materials Science are almost entirely journal-only.
  • Figure 4: BibTeX retrieval outcomes by domain. Quantum Computing and Materials Science achieve $>$95% success. Medicine's lower rate reflects paywalled journal articles that Zotero cannot resolve, with many title-mismatch failures on fallback queries.
  • Figure 5: Canonical field coverage by domain (percentage of papers with a resolved value). Core fields (DOI, title, author, year) exceed 80% in all domains except Medicine. Booktitle is concentrated in AI (conference proceedings); journal dominates the other three domains. Medicine has the lowest coverage due to its lower BibTeX retrieval rate.
  • ...and 16 more figures