GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models

Zuyao Xu; Yuqi Qiu; Lu Sun; FaSheng Miao; Fubin Wu; Xinyi Wang; Xiang Li; Haozhe Lu; ZhengZe Zhang; Yuxin Hu; Jialu Li; Jin Luo; Feng Zhang; Rui Luo; Xinran Liu; Yingxian Li; Jiaji Liu

GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models

Zuyao Xu, Yuqi Qiu, Lu Sun, FaSheng Miao, Fubin Wu, Xinyi Wang, Xiang Li, Haozhe Lu, ZhengZe Zhang, Yuxin Hu, Jialu Li, Jin Luo, Feng Zhang, Rui Luo, Xinran Liu, Yingxian Li, Jiaji Liu

TL;DR

This work introduces CiteVerifier, an automated, scalable framework for validating citations at scale and conducts three complementary studies to quantify ghost citations in the era of large language models. It shows that all tested LLMs hallucinate citations at substantial rates (14.23%–94.93%) and that a non-trivial fraction of the published literature (1.07%) already contains invalid or untraceable references, with an alarming 80.9% surge in 2025. The authors also expose a verification gap among researchers and reviewers, where AI adoption coexists with limited inspection of references, and propose multi-stakeholder mitigations including automated DOI checks and retrieval-grounded bibliography tools. The findings highlight an accelerating crisis in citation integrity and call for coordinated action from researchers, venues, and tool developers to protect the scientific record.

Abstract

Citations provide the basis for trusting scientific claims; when they are invalid or fabricated, this trust collapses. With the advent of Large Language Models (LLMs), this risk has intensified: LLMs are increasingly used for academic writing, yet their tendency to fabricate citations (``ghost citations'') poses a systemic threat to citation validity. To quantify this threat and inform mitigation, we develop CiteVerifier, an open-source framework for large-scale citation verification, and conduct the first comprehensive study of citation validity in the LLM era through three experiments built on it. We benchmark 13 state-of-the-art LLMs on citation generation across 40 research domains, finding that all models hallucinate citations at rates from 14.23\% to 94.93\%, with significant variation across research domains. Moreover, we analyze 2.2 million citations from 56,381 papers published at top-tier AI/ML and Security venues (2020--2025), confirming that 1.07\% of papers contain invalid or fabricated citations (604 papers), with an 80.9\% increase in 2025 alone. Furthermore, we survey 97 researchers and analyze 94 valid responses after removing 3 conflicting samples, revealing a critical ``verification gap'': 41.5\% of researchers copy-paste BibTeX without checking and 44.4\% choose no-action responses when encountering suspicious references; meanwhile, 76.7\% of reviewers do not thoroughly check references and 80.0\% never suspect fake citations. Our findings reveal an accelerating crisis where unreliable AI tools, combined with inadequate human verification by researchers and insufficient peer review scrutiny, enable fabricated citations to contaminate the scientific record. We propose interventions for researchers, venues, and tool developers to protect citation integrity.

GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models

TL;DR

Abstract

GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)