Table of Contents
Fetching ...

Source-Aware Training Enables Knowledge Attribution in Language Models

Muhammad Khalifa, David Wadden, Emma Strubell, Honglak Lee, Lu Wang, Iz Beltagy, Hao Peng

TL;DR

Intrinsic source citation aims to attribute LLM knowledge to pretraining sources. The authors propose source-aware training, injecting document IDs during pretraining and instruction tuning to enable citation of supporting sources, with minimal architectural changes. They validate on a synthetic BioCite dataset and show attribution is feasible with modest impact on perplexity, though data augmentation is important for generalization. The work provides practical guidance for building verifiable and transparent language models by tying parametric knowledge to traceable sources.

Abstract

Large language models (LLMs) learn a vast amount of knowledge during pretraining, but they are often oblivious to the source(s) of such knowledge. We investigate the problem of intrinsic source citation, where LLMs are required to cite the pretraining source supporting a generated response. Intrinsic source citation can enhance LLM transparency, interpretability, and verifiability. To give LLMs such ability, we explore source-aware training -- a recipe that involves (i) training the LLM to associate unique source document identifiers with the knowledge in each document, followed by (ii) an instruction-tuning stage to teach the LLM to cite a supporting pretraining source when prompted. Source-aware training borrows from existing pretraining/fine-tuning frameworks and requires minimal changes to the model architecture or implementation. Through experiments on synthetic data, we demonstrate that our training recipe can enable faithful attribution to the pretraining data without a substantial impact on the model's perplexity compared to standard pretraining. Our findings also highlight the importance of pretraining data augmentation in achieving attribution. Code and data available here: \url{https://github.com/mukhal/intrinsic-source-citation}

Source-Aware Training Enables Knowledge Attribution in Language Models

TL;DR

Intrinsic source citation aims to attribute LLM knowledge to pretraining sources. The authors propose source-aware training, injecting document IDs during pretraining and instruction tuning to enable citation of supporting sources, with minimal architectural changes. They validate on a synthetic BioCite dataset and show attribution is feasible with modest impact on perplexity, though data augmentation is important for generalization. The work provides practical guidance for building verifiable and transparent language models by tying parametric knowledge to traceable sources.

Abstract

Large language models (LLMs) learn a vast amount of knowledge during pretraining, but they are often oblivious to the source(s) of such knowledge. We investigate the problem of intrinsic source citation, where LLMs are required to cite the pretraining source supporting a generated response. Intrinsic source citation can enhance LLM transparency, interpretability, and verifiability. To give LLMs such ability, we explore source-aware training -- a recipe that involves (i) training the LLM to associate unique source document identifiers with the knowledge in each document, followed by (ii) an instruction-tuning stage to teach the LLM to cite a supporting pretraining source when prompted. Source-aware training borrows from existing pretraining/fine-tuning frameworks and requires minimal changes to the model architecture or implementation. Through experiments on synthetic data, we demonstrate that our training recipe can enable faithful attribution to the pretraining data without a substantial impact on the model's perplexity compared to standard pretraining. Our findings also highlight the importance of pretraining data augmentation in achieving attribution. Code and data available here: \url{https://github.com/mukhal/intrinsic-source-citation}
Paper Structure (34 sections, 6 figures, 7 tables)

This paper contains 34 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Intrinsic source citation: The language model cites pretraining document(s) from which it acquired its relevant parametric knowledge.
  • Figure 2: Training and evaluation setup: The pretraining corpus is split into in-domain and out-of-domain documents. The in-domain documents are used to create instruction tuning examples, and the out-of-domain documents are used for attribution evaluation.
  • Figure 3: Example of chain-of-thought attribution. During both training and inference, the model cites the remaining part of the document before generating the doc ID.
  • Figure 4: Left: Answer EM over questions from in-domain and OOD documents after 1 fine-tuning epoch with different ID injection strategies (\ref{['sec:pretraining']}). The LLM can generalize well to out-of-domain (OOD) questions in all document ID locations, although both in-domain and OOD answer EM scores degrade with doc-begin. Right: Hits@1 over in-domain and OOD questions during instruction tuning. Only repeat and doc-end + CoT can achieve OOD attribution.
  • Figure 5: Left: LLM quality vs. OOD attribution. Higher is better for Hits@1 and Log Likelihood. Optimal is top-right corner. doc-end + CoT is Pareto-optimal as it strikes the best balance between LLM quality and OOD attribution. Right: OOD attribution performance with different gold document lengths.
  • ...and 1 more figures