Table of Contents
Fetching ...

Do Deployment Constraints Make LLMs Hallucinate Citations? An Empirical Study across Four Models and Five Prompting Regimes

Chen Zhao, Yuan Tang, Yitian Qian

TL;DR

It is studied how deployment-motivated prompting constraints affect citation verifiability in a closed-book setting and results motivate post-hoc citation verification before LLM outputs enter SE literature reviews or tooling pipelines.

Abstract

LLMs are increasingly used to draft academic text and to support software engineering (SE) evidence synthesis, but they often hallucinate bibliographic references that look legitimate. We study how deployment-motivated prompting constraints affect citation verifiability in a closed-book setting. Using 144 claims (24 in SE&CS) and a deterministic verification pipeline (Crossref + Semantic Scholar), we evaluate two proprietary models (Claude Sonnet, GPT-4o) and two open-weight models (LLaMA~3.1-8B, Qwen~2.5-14B) across five regimes: Baseline, Temporal (publication-year window), Survey-style breadth, Non-Disclosure policy, and their combination. Across 17,443 generated citations, no model exceeds a citation-level existence rate of 0.475; Temporal and Combo conditions produce the steepest drops while outputs remain format-compliant (well-formed bibliographic fields). Unresolved outcomes dominate (36-61%); a 100-citation audit indicates that a substantial fraction of Unresolved cases are fabricated. Results motivate post-hoc citation verification before LLM outputs enter SE literature reviews or tooling pipelines.

Do Deployment Constraints Make LLMs Hallucinate Citations? An Empirical Study across Four Models and Five Prompting Regimes

TL;DR

It is studied how deployment-motivated prompting constraints affect citation verifiability in a closed-book setting and results motivate post-hoc citation verification before LLM outputs enter SE literature reviews or tooling pipelines.

Abstract

LLMs are increasingly used to draft academic text and to support software engineering (SE) evidence synthesis, but they often hallucinate bibliographic references that look legitimate. We study how deployment-motivated prompting constraints affect citation verifiability in a closed-book setting. Using 144 claims (24 in SE&CS) and a deterministic verification pipeline (Crossref + Semantic Scholar), we evaluate two proprietary models (Claude Sonnet, GPT-4o) and two open-weight models (LLaMA~3.1-8B, Qwen~2.5-14B) across five regimes: Baseline, Temporal (publication-year window), Survey-style breadth, Non-Disclosure policy, and their combination. Across 17,443 generated citations, no model exceeds a citation-level existence rate of 0.475; Temporal and Combo conditions produce the steepest drops while outputs remain format-compliant (well-formed bibliographic fields). Unresolved outcomes dominate (36-61%); a 100-citation audit indicates that a substantial fraction of Unresolved cases are fabricated. Results motivate post-hoc citation verification before LLM outputs enter SE literature reviews or tooling pipelines.
Paper Structure (41 sections, 1 equation, 3 figures, 3 tables)

This paper contains 41 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Citation-level outcome distribution (Existing, Unresolved, Fabricated) for each model under all five conditions, shown as stacked proportions summing to one. "Non-Disc." denotes the non-disclosure condition.
  • Figure 2: Per-claim verification fraction by model and condition (boxes show IQR with median; whiskers follow the 1.5 $\times$ IQR convention). "Non-Disc." denotes the non-disclosure condition.
  • Figure 3: Existence rate by domain (24 claims per group, aggregated across conditions).