Table of Contents
Fetching ...

A Reality Check on Context Utilisation for Retrieval-Augmented Generation

Lovisa Hagström, Sara Vera Marjanović, Haeun Yu, Arnav Arora, Christina Lioma, Maria Maistro, Pepa Atanasova, Isabelle Augenstein

TL;DR

This paper investigates how language models utilize retrieved context in retrieval-augmented generation under real-world conditions. It introduces DRUID, a large real-world dataset of claim evidence with human-annotated relevance and stance, and DRUID+ for expanded evidence coverage, alongside a novel ACU metric that measures context utilization through re-scaled probability shifts $ACU = \frac{1}{|T|}\sum_{t\in T} D(t,S_E) \Delta P_M (t|C,E)$ with $T=\{True, None, False\}$. By comparing DRUID to synthetic datasets CounterFact and ConflictQA, the study reveals that synthetic data exaggerate certain context traits and memory-conflict rates, while real-world contexts show weaker single-feature predictability and stronger influence from context sources. The findings underscore the need for real-world aligned context utilization studies to accurately assess and improve RAG performance, and the authors provide datasets and tools to facilitate such analyses.

Abstract

Retrieval-augmented generation (RAG) helps address the limitations of parametric knowledge embedded within a language model (LM). In real world settings, retrieved information can vary in complexity, yet most investigations of LM utilisation of context has been limited to synthetic text. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) and find that artificial datasets often fail to represent the complexity and diversity of realistically retrieved context. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results, as measured by our novel ACU score. Moreover, while previous work has mainly focused on singleton context characteristics to explain context utilisation, correlations between singleton context properties and ACU on DRUID are surprisingly small compared to other properties related to context source. Overall, our work underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.

A Reality Check on Context Utilisation for Retrieval-Augmented Generation

TL;DR

This paper investigates how language models utilize retrieved context in retrieval-augmented generation under real-world conditions. It introduces DRUID, a large real-world dataset of claim evidence with human-annotated relevance and stance, and DRUID+ for expanded evidence coverage, alongside a novel ACU metric that measures context utilization through re-scaled probability shifts with . By comparing DRUID to synthetic datasets CounterFact and ConflictQA, the study reveals that synthetic data exaggerate certain context traits and memory-conflict rates, while real-world contexts show weaker single-feature predictability and stronger influence from context sources. The findings underscore the need for real-world aligned context utilization studies to accurately assess and improve RAG performance, and the authors provide datasets and tools to facilitate such analyses.

Abstract

Retrieval-augmented generation (RAG) helps address the limitations of parametric knowledge embedded within a language model (LM). In real world settings, retrieved information can vary in complexity, yet most investigations of LM utilisation of context has been limited to synthetic text. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) and find that artificial datasets often fail to represent the complexity and diversity of realistically retrieved context. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results, as measured by our novel ACU score. Moreover, while previous work has mainly focused on singleton context characteristics to explain context utilisation, correlations between singleton context properties and ACU on DRUID are surprisingly small compared to other properties related to context source. Overall, our work underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.

Paper Structure

This paper contains 62 sections, 3 equations, 18 figures, 22 tables.

Figures (18)

  • Figure 1: Datasets for context usage investigations.
  • Figure 2: Average values for the context characteristics in CounterFact yu-etal-2023-characterizing, ConflictQA xie2024knowledgeconflict and DRUID datasets. The characteristics and their detection are described in \ref{['sec:context-characteristics', 'sec:characteristics-detection']}, respectively.
  • Figure 3: ACU (\ref{['eq:acu']}) for each model and dataset. The error bars indicate the standard deviation. Negative ACU values indicate 'context-repulsion': changes in probability away from the annotated evidence stance. The dashed horizontal lines indicate average ACU scores for each model and dataset.
  • Figure 4: Spearman correlations between context usage measured by ACU (\ref{['eq:acu']}) and different context characteristics for Llama. Significant correlation values (p-value < 0.05) are marked in bold.
  • Figure 5: The results of pruning attention heads in Pythia for the original sentence completion task and for when the task has been recast to a claim verification task.
  • ...and 13 more figures