ContextCite: Attributing Model Generation to Context

Benjamin Cohen-Wang; Harshay Shah; Kristian Georgiev; Aleksander Madry

ContextCite: Attributing Model Generation to Context

Benjamin Cohen-Wang, Harshay Shah, Kristian Georgiev, Aleksander Madry

TL;DR

The problem of context attribution is introduced: pinpointing the parts of the context (if any) that led a model to generate a particular statement and ContextCite is presented, a simple and scalable method for context attribution that can be applied on top of any existing language model.

Abstract

How do language models use information provided as context when generating a response? Can we infer whether a particular generated statement is actually grounded in the context, a misinterpretation, or fabricated? To help answer these questions, we introduce the problem of context attribution: pinpointing the parts of the context (if any) that led a model to generate a particular statement. We then present ContextCite, a simple and scalable method for context attribution that can be applied on top of any existing language model. Finally, we showcase the utility of ContextCite through three applications: (1) helping verify generated statements (2) improving response quality by pruning the context and (3) detecting poisoning attacks. We provide code for ContextCite at https://github.com/MadryLab/context-cite.

ContextCite: Attributing Model Generation to Context

TL;DR

Abstract

Paper Structure (39 sections, 5 equations, 6 figures, 1 algorithm)

This paper contains 39 sections, 5 equations, 6 figures, 1 algorithm.

Introduction
Our contributions
Formalizing context attribution (\ref{['sec:problem_statement']}).
Performing context attribution with ContextCite (\ref{['sec:method', 'sec:evaluation']}).
Applying context attribution (\ref{['sec:applications']}).
Problem statement
Setup.
Context attribution
What do context attribution scores signify?
Evaluating the quality of context attributions
Attributing selected statements from the response
Context attribution with ContextCite
Evaluating ContextCite
Datasets.
Models.
...and 24 more sections

Figures (6)

Figure 1: ContextCite. Our context attribution method, ContextCite, traces any specified generated statement back to the parts of the context that are responsible for it.
Figure 2: An example of the linear surrogate model used by ContextCite. On the left, we consider a context, query, and response generated by Llama-3-8Bdubey2024llama about weather in Antarctica. In the middle, we list the weights of a linear surrogate model that estimates the logit-scaled probability of the response as a function of the context ablation vector \ref{['eq:model_output']}; ContextCite casts these weights as attribution scores. On the right, we plot the surrogate model's predictions against the actual logit-scaled probabilities for random context ablations. Two sources appear to be primarily responsible for the response, resulting in four "clusters" corresponding to whether each of these sources is included or excluded. These sources appear to interact linearly---the effect of removing both sources is close to the sum of the effects of removing each source individually. As a result, the linear surrogate model faithfully captures the language model's behavior.
Figure 3: Inducing sparsity improves the surrogate model's sample efficiency. In CNN DailyMail nallapati2016abstractive, a summarization task, and Natural Questions kwiatkowski2019natural, a question answering task, we observe that the number of sources that are "relevant" for a particular statement generated by Llama-3-8Bdubey2024llama is small, even when the context comprises many sources (\ref{['fig:ground_truth_sparsity']}). Therefore, inducing sparsity via Lasso yields an accurate surrogate model with just a few ablations (\ref{['fig:comparing_lasso_and_ols']}). See \ref{['sec:sparsity_details']} for the exact setup.
Figure 4: Evaluating context attributions. We report the top-$k$ log-probability drop (\ref{['fig:log_prob_drop']}) and linear datamodeling score (\ref{['fig:ldsipe']}) of ContextCite and baselines. We evaluate attributions of responses generated by Llama-3-8B and Phi-3-mini on up to $1,000$ randomly sampled validation examples from each of three benchmarks. We find that ContextCite using just $32$ context ablations consistently matches or outperforms the baselines---attention, gradient norm, semantic similarity and leave-one-out---across benchmarks and models. Increasing the number of context ablations to $\{64,128,256\}$ can further improve the quality of ContextCite attributions.
Figure 5: Helping verify generated statements using ContextCite. We report the AUC of Llama-3-8B for verifying the correctness of its own answers when we provide it with the top-$k$ sources identified by ContextCite and when we provide it with the entire context. We consider $1,000$ random examples from HotpotQA on the left and $1,000$ random examples from Natural Questions on the right. In both cases, using the top-$k$ sources results in substantially more effective verification than using the entire context, suggesting that ContextCite can help language models verify their own statements.
...and 1 more figures

Theorems & Definitions (3)

Definition 2.1: Context attribution
Definition 2.2: Top-$k$ log-probability drop
Definition 2.3: Linear datamodeling score

ContextCite: Attributing Model Generation to Context

TL;DR

Abstract

ContextCite: Attributing Model Generation to Context

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (3)