Table of Contents
Fetching ...

Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

Delip Rao, Eric Wong, Chris Callison-Burch

Abstract

Large language models and deep research agents supply citation URLs to support their claims, yet the reliability of these citations has not been systematically measured. We address six research questions about citation URL validity using 10 models and agents on DRBench (53,090 URLs) and 3 models on ExpertQA (168,021 URLs across 32 academic fields). We find that 3--13\% of citation URLs are hallucinated -- they have no record in the Wayback Machine and likely never existed -- while 5--18\% are non-resolving overall. Deep research agents generate substantially more citations per query than search-augmented LLMs but hallucinate URLs at higher rates. Domain effects are pronounced: non-resolving rates range from 5.4\% (Business) to 11.4\% (Theology), with per-model effects even larger. Decomposing failures reveals that some models fabricate every non-resolving URL, while others show substantial link-rot fractions indicating genuine retrieval. As a solution, we release urlhealth, an open-source tool for URL liveness checking and stale-vs-hallucinated classification using the Wayback Machine. In agentic self-correction experiments, models equipped with urlhealth reduce non-resolving citation URLs by $6\textrm{--}79\times$ to under 1\%, though effectiveness depends on the model's tool-use competence. The tool and all data are publicly available. Our characterization findings, failure taxonomy, and open-source tooling establish that citation URL validity is both measurable at scale and correctable in practice.

Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

Abstract

Large language models and deep research agents supply citation URLs to support their claims, yet the reliability of these citations has not been systematically measured. We address six research questions about citation URL validity using 10 models and agents on DRBench (53,090 URLs) and 3 models on ExpertQA (168,021 URLs across 32 academic fields). We find that 3--13\% of citation URLs are hallucinated -- they have no record in the Wayback Machine and likely never existed -- while 5--18\% are non-resolving overall. Deep research agents generate substantially more citations per query than search-augmented LLMs but hallucinate URLs at higher rates. Domain effects are pronounced: non-resolving rates range from 5.4\% (Business) to 11.4\% (Theology), with per-model effects even larger. Decomposing failures reveals that some models fabricate every non-resolving URL, while others show substantial link-rot fractions indicating genuine retrieval. As a solution, we release urlhealth, an open-source tool for URL liveness checking and stale-vs-hallucinated classification using the Wayback Machine. In agentic self-correction experiments, models equipped with urlhealth reduce non-resolving citation URLs by to under 1\%, though effectiveness depends on the model's tool-use competence. The tool and all data are publicly available. Our characterization findings, failure taxonomy, and open-source tooling establish that citation URL validity is both measurable at scale and correctable in practice.

Paper Structure

This paper contains 67 sections, 1 equation, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Non-resolving URL rates for DRBench models, grouped by provider. Each bar decomposes into hallucinated URLs (red; no Wayback Machine archive, indicating the URL likely never existed) and stale URLs (orange; archived but currently dead). Numbers above bars indicate total URLs per model.
  • Figure 2: Non-resolving URL rates by academic field and model for ExpertQA. Fields are sorted by overall non-resolving URL rate.
  • Figure 3: Non-resolving URL rates by subfield within Healthcare/Medicine. The top 15 subfields (sorted by rate) range from 14.8% (Virologist) to 21.4% (General practitioner). Subfield labels A--O are keyed in the accompanying table.
  • Figure 4: Distribution of urlhealth correction rounds per question (435 questions each, 3 models). The three models exhibit distinct self-correction profiles. Gemini 2.5 Pro (green) completes in 1--2 rounds every time: its two-phase architecture (Google Search grounding followed by a single verification turn) caps it at two rounds, and 44% of questions need only one. GPT-5.1 (orange) clusters at 2 rounds (61%), with 87% finishing in 2--3 rounds; the long tail to 6 rounds is rare ($<$2% of questions). Claude Sonnet 4.5 (blue) concentrates at 3--5 rounds (83% of questions, median 4), reflecting the cost of verifying its large citation sets (18.4 URLs/question vs. 9.7--11.1 for the other models). All three models achieve near-zero hallucinated citations in final output despite these different convergence patterns.