Table of Contents
Fetching ...

Context Quality Matters in Training Fusion-in-Decoder for Extractive Open-Domain Question Answering

Kosuke Akimoto, Kunihiro Takeoka, Masafumi Oyamada

TL;DR

This work investigates how context quality and quantity during Fusion-in-Decoder (FiD) training affect extractive open-domain QA. It shows FiD tends to overfit to the training context quality, with cross-attention patterns that shift toward more selective focus on relevant passages as quality decreases, and demonstrates that exposure to mixed-context-quality training exacerbates or mitigates this effect depending on the evaluation setting. The authors prove causality via cross-attention interventions and propose a temperature-based adaptation that smooths attention distributions at inference, improving robustness to unseen context quality. The findings have practical implications for designing training data and inference-time controls in retrieval-augmented QA systems, enabling better generalization across varying context qualities without retraining.

Abstract

Retrieval-augmented generation models augment knowledge encoded in a language model by providing additional relevant external knowledge (context) during generation. Although it has been shown that the quantity and quality of context impact the performance of retrieval-augmented generation models during inference, limited research explores how these characteristics affect model training. This paper explores how context quantity and quality during model training affect the performance of Fusion-in-Decoder (FiD), the state-of-the-art retrieval-augmented generation model, in extractive open-domain question answering tasks. Experimental results suggest that FiD models overfit to context quality during training and show suboptimal performance when evaluated on different context quality. Through the experimental results, we also reveal FiD models trained with different context quality have different cross-attention distribution patterns. Specifically, as context quality during training increases, FiD models tend to attend more uniformly to each passage in context. Finally, based on these observations, we propose a method to mitigate overfitting to specific context quality by introducing bias to the cross-attention distribution, which we demonstrate to be effective in improving the performance of FiD models on different context quality.

Context Quality Matters in Training Fusion-in-Decoder for Extractive Open-Domain Question Answering

TL;DR

This work investigates how context quality and quantity during Fusion-in-Decoder (FiD) training affect extractive open-domain QA. It shows FiD tends to overfit to the training context quality, with cross-attention patterns that shift toward more selective focus on relevant passages as quality decreases, and demonstrates that exposure to mixed-context-quality training exacerbates or mitigates this effect depending on the evaluation setting. The authors prove causality via cross-attention interventions and propose a temperature-based adaptation that smooths attention distributions at inference, improving robustness to unseen context quality. The findings have practical implications for designing training data and inference-time controls in retrieval-augmented QA systems, enabling better generalization across varying context qualities without retraining.

Abstract

Retrieval-augmented generation models augment knowledge encoded in a language model by providing additional relevant external knowledge (context) during generation. Although it has been shown that the quantity and quality of context impact the performance of retrieval-augmented generation models during inference, limited research explores how these characteristics affect model training. This paper explores how context quantity and quality during model training affect the performance of Fusion-in-Decoder (FiD), the state-of-the-art retrieval-augmented generation model, in extractive open-domain question answering tasks. Experimental results suggest that FiD models overfit to context quality during training and show suboptimal performance when evaluated on different context quality. Through the experimental results, we also reveal FiD models trained with different context quality have different cross-attention distribution patterns. Specifically, as context quality during training increases, FiD models tend to attend more uniformly to each passage in context. Finally, based on these observations, we propose a method to mitigate overfitting to specific context quality by introducing bias to the cross-attention distribution, which we demonstrate to be effective in improving the performance of FiD models on different context quality.
Paper Structure (25 sections, 3 equations, 10 figures, 10 tables)

This paper contains 25 sections, 3 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Performance of FiD models on Natural Questions with varying training context quality. Panels represent different evaluation environments with different $(n_\text{eval}^+,n_\text{eval})$ pairs, and a red dashed line shows the context quality of the corresponding evaluation environment. Red stars represent the best-performed models in the corresponding evaluation environments. Dotted lines show models trained on the same context quantity $n_\text{train}$.
  • Figure 2: Performance of FiD models on TriviaQA with varying training context quantity. Panels represent different evaluation environments with different $(n^+_\text{eval},k_\text{eval})$ pairs, and a red dashed line shows the context quantity of the corresponding evaluation environment. Dotted lines show models trained on the same context quality $\frac{1}{1+k_\text{train}}$.
  • Figure 3: Distribution of cross-attention probability to each relevant or irrelevant passage at Layer 9. A similar trend can be seen in other higher layers. Red vertical dashed lines represent uniform cross-attention probability, i.e., $\frac{1}{N}$ if context quantity is $N$.
  • Figure 4: Model performance under intervention on cross-attention probability. "No" represents a setting without intervention.
  • Figure 5: Top panels: Performance of FiD models on Natural Questions with adaptation by the proposed method (solid lines) and without adaptation (dotted lines). Bottom panels: Optimal temperature parameter $T^*$ selected for each model. Multiple $T^*$ were selected for some context qualities, i.e., training environments, because we selected $T^*$ for each of the three models trained with different random seeds for each training environment. Panels represent different evaluation environments with different $(n_\text{eval}^+, n_\text{eval})$ pairs, and a red dashed line shows the context quality of the corresponding evaluation environment.
  • ...and 5 more figures