Table of Contents
Fetching ...

Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

Yukun Huang, Sanxing Chen, Jian Pei, Manzil Zaheer, Bhuwan Dhingra

TL;DR

The paper tackles the challenge of trustworthy, verifiable answers from LLMs by enabling retrieval-free internal citations tied to training data. It introduces CitePretrainBench and a two-stage training regime (continual pretraining with Active Indexing and subsequent instruction tuning) to index and cite knowledge using persistent document identifiers. Active Indexing, combining Forward and Backward augmentation, substantially improves citation precision and recall across multiple datasets, scales with augmented data, and remains complementary to retrieval-based methods. The work demonstrates that internal citations can be robust to retrieval noise while enhancing explainability, with practical implications for reducing latency and infrastructure dependence in citation-rich AI systems.

Abstract

Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.

Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

TL;DR

The paper tackles the challenge of trustworthy, verifiable answers from LLMs by enabling retrieval-free internal citations tied to training data. It introduces CitePretrainBench and a two-stage training regime (continual pretraining with Active Indexing and subsequent instruction tuning) to index and cite knowledge using persistent document identifiers. Active Indexing, combining Forward and Backward augmentation, substantially improves citation precision and recall across multiple datasets, scales with augmented data, and remains complementary to retrieval-based methods. The work demonstrates that internal citations can be robust to retrieval noise while enhancing explainability, with practical implications for reducing latency and infrastructure dependence in citation-rich AI systems.

Abstract

Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.

Paper Structure

This paper contains 73 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: CitePretrain Framework. We construct a diverse corpus (Wikipedia, ArXiv, Common Crawl, and novel documents) for LLMs to index. Each document is indexed via passive indexing (appending a document ID) and active indexing, which includes: (1) Forward augmentation: generating entity-based QA pairs to map IDs to facts; and (2) Backward augmentation: retrieving related documents to synthesize multi-source QA pairs with citations, mapping facts to IDs. The model is continually pre-trained and instruction-tuned, then evaluated on long- and short-form citation QA tasks.
  • Figure 2: Scaling Comparison Between Active Indexing and Passive Indexing on RepliQA.
  • Figure 3: Performance of internal, external, and hybrid citations across retrieval quality (0=sparse retrieval, 1=dense retrieval). Internal only excels under poor retrieval, external only under strong retrieval, while hybrids generally perform best regardless of retrieval quality, with room to improve.