Table of Contents
Fetching ...

Unveiling the Magic: Investigating Attention Distillation in Retrieval-augmented Generation

Zizhong Li, Haopeng Zhang, Jiawei Zhang

TL;DR

This study investigates attention distillation in retrieval-augmented generation, aiming to understand when attention-based supervision improves retriever-reader training. It introduces a decoder-only attention framework, defines the attention distribution $p_{ATTN}$ and the retriever distribution $p_{RETR}$, and optimizes them via minimizing $D_{KL}$ to reveal how attention signals guide retrieval. Through extensive QA experiments on NaturalQuestions and TriviaQA, the authors show that high-quality readers are essential for effective distillation, with fine-tuned training (Step2) delivering the best QA and retrieval metrics, and they identify two practical indicators for distillation quality focused on answer-related and question-related tokens. The work highlights the importance of reader quality, proposes actionable training priorities, and discusses extensions to encoder-decoder architectures while acknowledging limitations tied to model scale and perplexity-distillation applicability, suggesting directions for future validation on larger LMs.

Abstract

Retrieval-augmented generation framework can address the limitations of large language models by enabling real-time knowledge updates for more accurate answers. An efficient way in the training phase of retrieval-augmented models is attention distillation, which uses attention scores as a supervision signal instead of manually annotated query-document pairs. Despite its growing popularity, the detailed mechanisms behind the success of attention distillation remain unexplored, particularly the specific patterns it leverages to benefit training. In this paper, we address this gap by conducting a comprehensive review of attention distillation workflow and identifying key factors influencing the learning quality of retrieval-augmented language models. We further propose indicators for optimizing models' training methods and avoiding ineffective training.

Unveiling the Magic: Investigating Attention Distillation in Retrieval-augmented Generation

TL;DR

This study investigates attention distillation in retrieval-augmented generation, aiming to understand when attention-based supervision improves retriever-reader training. It introduces a decoder-only attention framework, defines the attention distribution and the retriever distribution , and optimizes them via minimizing to reveal how attention signals guide retrieval. Through extensive QA experiments on NaturalQuestions and TriviaQA, the authors show that high-quality readers are essential for effective distillation, with fine-tuned training (Step2) delivering the best QA and retrieval metrics, and they identify two practical indicators for distillation quality focused on answer-related and question-related tokens. The work highlights the importance of reader quality, proposes actionable training priorities, and discusses extensions to encoder-decoder architectures while acknowledging limitations tied to model scale and perplexity-distillation applicability, suggesting directions for future validation on larger LMs.

Abstract

Retrieval-augmented generation framework can address the limitations of large language models by enabling real-time knowledge updates for more accurate answers. An efficient way in the training phase of retrieval-augmented models is attention distillation, which uses attention scores as a supervision signal instead of manually annotated query-document pairs. Despite its growing popularity, the detailed mechanisms behind the success of attention distillation remain unexplored, particularly the specific patterns it leverages to benefit training. In this paper, we address this gap by conducting a comprehensive review of attention distillation workflow and identifying key factors influencing the learning quality of retrieval-augmented language models. We further propose indicators for optimizing models' training methods and avoiding ineffective training.
Paper Structure (12 sections, 2 equations, 8 figures, 4 tables)

This paper contains 12 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Training Contriever on NaturalQuestions for the QA task with attention distillation shows an improved Hit Rate @ 5 with a fine-tuned reader but a significant decline with an off-the-shelf reader.
  • Figure 2: The framework of the Retrieval-augmented Language Model of our experiment.
  • Figure 3: The attention score distribution histogram (left) and Spearman correlation distribution histogram of $95^{th}$ percentile answer-related tokens under NQ dataset.
  • Figure 4: Model performance (top) and their attention distillation analysis (bottom) of Atlas-large model (yellow) for the answer-related tokens, comparing with Fine-tuned Distillation Training (Step2) (blue).
  • Figure 5: The attention score distribution histogram (left) and Spearman correlation distribution histogram of $95^{th}$ percentile answer-related tokens under NQ dataset.
  • ...and 3 more figures