Table of Contents
Fetching ...

A Theory for Token-Level Harmonization in Retrieval-Augmented Generation

Shicheng Xu, Liang Pang, Huawei Shen, Xueqi Cheng

TL;DR

This paper tackles how retrieval-augmented generation (RAG) can both aid and mislead large language models (LLMs) at the token level. It introduces a theory that treats next-token prediction as a fusion of the LLM's knowledge distribution $p(\cdot)$ and the retrieved-text distribution $p_R(\cdot)$, decomposing the fusion into distribution completion (benefit) and distribution contradiction (detriment) and showing the net effect is governed by their subtraction. The authors prove that the actual RAG effect can be predicted from representation similarity and without access to retrieval utility or retraining, enabling token-level explainability. Based on this theory, Tok-RAG performs collaborative generation between the pure LLM and RAG in parallel, selecting tokens by comparing the relative magnitudes of benefit and detriment via a similarity criterion. Experiments on OPT, LLaMA-2, and Mistral demonstrate improved performance and robustness without extra training or utility evaluators, highlighting the practical impact of a principled, theory-driven approach to RAG.

Abstract

Retrieval-augmented generation (RAG) utilizes retrieved texts to enhance large language models (LLMs). Studies show that while RAG provides valuable external information (benefit), it may also mislead LLMs (detriment) with noisy or incorrect retrieved texts. Although many existing methods attempt to preserve benefit and avoid detriment, they lack a theoretical explanation for RAG. The benefit and detriment in the next token prediction of RAG remain a black box that cannot be quantified or compared in an explainable manner, so existing methods are data-driven, need additional utility evaluators or post-hoc. This paper takes the first step towards providing a theory to explain and trade off the benefit and detriment in RAG. First, we model RAG as the fusion between distribution of LLMs knowledge and distribution of retrieved texts. Then, we formalize the trade-off between the value of external knowledge (benefit) and its potential risk of misleading LLMs (detriment) in next token prediction of RAG by distribution difference in this fusion. Finally, we prove that the actual effect of RAG on the token, which is the comparison between benefit and detriment, can be predicted without any training or accessing the utility of retrieval. Based on our theory, we propose a practical novel method, Tok-RAG, which achieves collaborative generation between the pure LLM and RAG at token level to preserve benefit and avoid detriment. Experiments in real-world tasks using LLMs such as OPT, LLaMA-2, and Mistral show the effectiveness of our method and support our theoretical findings.

A Theory for Token-Level Harmonization in Retrieval-Augmented Generation

TL;DR

This paper tackles how retrieval-augmented generation (RAG) can both aid and mislead large language models (LLMs) at the token level. It introduces a theory that treats next-token prediction as a fusion of the LLM's knowledge distribution and the retrieved-text distribution , decomposing the fusion into distribution completion (benefit) and distribution contradiction (detriment) and showing the net effect is governed by their subtraction. The authors prove that the actual RAG effect can be predicted from representation similarity and without access to retrieval utility or retraining, enabling token-level explainability. Based on this theory, Tok-RAG performs collaborative generation between the pure LLM and RAG in parallel, selecting tokens by comparing the relative magnitudes of benefit and detriment via a similarity criterion. Experiments on OPT, LLaMA-2, and Mistral demonstrate improved performance and robustness without extra training or utility evaluators, highlighting the practical impact of a principled, theory-driven approach to RAG.

Abstract

Retrieval-augmented generation (RAG) utilizes retrieved texts to enhance large language models (LLMs). Studies show that while RAG provides valuable external information (benefit), it may also mislead LLMs (detriment) with noisy or incorrect retrieved texts. Although many existing methods attempt to preserve benefit and avoid detriment, they lack a theoretical explanation for RAG. The benefit and detriment in the next token prediction of RAG remain a black box that cannot be quantified or compared in an explainable manner, so existing methods are data-driven, need additional utility evaluators or post-hoc. This paper takes the first step towards providing a theory to explain and trade off the benefit and detriment in RAG. First, we model RAG as the fusion between distribution of LLMs knowledge and distribution of retrieved texts. Then, we formalize the trade-off between the value of external knowledge (benefit) and its potential risk of misleading LLMs (detriment) in next token prediction of RAG by distribution difference in this fusion. Finally, we prove that the actual effect of RAG on the token, which is the comparison between benefit and detriment, can be predicted without any training or accessing the utility of retrieval. Based on our theory, we propose a practical novel method, Tok-RAG, which achieves collaborative generation between the pure LLM and RAG at token level to preserve benefit and avoid detriment. Experiments in real-world tasks using LLMs such as OPT, LLaMA-2, and Mistral show the effectiveness of our method and support our theoretical findings.
Paper Structure (29 sections, 6 theorems, 72 equations, 5 figures, 6 tables)

This paper contains 29 sections, 6 theorems, 72 equations, 5 figures, 6 tables.

Key Result

Corollary 1

Two terms about distribution difference in Equation eq4 measure the benefit and detriment respectively. The subtraction between benefit and detriment describes the trade-off relationship between the value of external knowledge and its potential risk of misleading LLM in the next token prediction wit

Figures (5)

  • Figure 1: Framework of our Tok-RAG. It performs collaborative generation between pure LLM and RAG at the token-level by comparing benefit and detriment based on our theoretical findings about distribution difference. The selected tokens at each step are used as the prefix for both pure LLM and RAG. Tok-RAG preserves benefit and avoids detriment without any training or utility evaluators.
  • Figure 2: Derivation path of our theory. Reference: Equation \ref{['eq4']}, Theorem \ref{['the_1']}, \ref{['the_2']}, \ref{['the_3']} and Corollary \ref{['co_1']}, \ref{['co_2']}, \ref{['co_3']}.
  • Figure 3: Attention score for $x_i$ (blue line) and difference of word distribution change (yellow line) vary with layers. stage 1: Lexical and Syntactic. stage 2: Text Matching. stage 3: Distribution Fusion.
  • Figure 4: AUC varies with layer.
  • Figure 5: Case study for collaborative generation between pure LLM and RAG at token level in our Tok-RAG. Pure LLM and RAG generate the texts in parallel at token level. At the step that pure LLM and RAG generate the different tokens, Tok-RAG use our theoretical results in Theorem \ref{['the_3']} to compare the benefit and detriment. If benefit is greater than detriment, the token from RAG is selected, otherwise, the token from pure LLM is selected. The selected tokens are marked by green color and bold. The discarded tokens are marked by gray. The orange arrow represents the direction of token selection and usage. The selected tokens are used for the next step generation of both pure LLM and RAG.

Theorems & Definitions (12)

  • Corollary 1
  • Theorem 1
  • Theorem 2
  • Corollary 2
  • Theorem 3
  • Corollary 3
  • proof
  • proof
  • proof
  • proof
  • ...and 2 more