Table of Contents
Fetching ...

Do Retrieval-Augmented Language Models Adapt to Varying User Needs?

Peilin Wu, Xinlu Zhang, Wenhao Yu, Xingyu Liu, Xinya Du, Zhiyu Zoey Chen

TL;DR

The paper argues that Retrieval-Augmented Language Models must adapt to diverse user needs and retrieval conditions. It introduces a user-centric evaluation framework that combines three user need cases with three context settings, and validates it with experiments on HotpotQA, DisentQA, and URAQ using two model families. Key findings show that memory restriction can boost robustness under adversarial retrieval but reduce peak performance, and that model-family and scale effects dominate behavior more than instruction type alone. The work highlights the necessity of user-centric benchmarking for real-world RALMs and provides insights into optimizing performance across varied retrieval contexts, with URAQ released to support future research.

Abstract

Recent advancements in Retrieval-Augmented Language Models (RALMs) have demonstrated their efficacy in knowledge-intensive tasks. However, existing evaluation benchmarks often assume a single optimal approach to leveraging retrieved information, failing to account for varying user needs. This paper introduces a novel evaluation framework that systematically assesses RALMs under three user need cases-Context-Exclusive, Context-First, and Memory-First-across three distinct context settings: Context Matching, Knowledge Conflict, and Information Irrelevant. By varying both user instructions and the nature of retrieved information, our approach captures the complexities of real-world applications where models must adapt to diverse user requirements. Through extensive experiments on multiple QA datasets, including HotpotQA, DisentQA, and our newly constructed synthetic URAQ dataset, we find that restricting memory usage improves robustness in adversarial retrieval conditions but decreases peak performance with ideal retrieval results and model family dominates behavioral differences. Our findings highlight the necessity of user-centric evaluations in the development of retrieval-augmented systems and provide insights into optimizing model performance across varied retrieval contexts. We will release our code and URAQ dataset upon acceptance of the paper.

Do Retrieval-Augmented Language Models Adapt to Varying User Needs?

TL;DR

The paper argues that Retrieval-Augmented Language Models must adapt to diverse user needs and retrieval conditions. It introduces a user-centric evaluation framework that combines three user need cases with three context settings, and validates it with experiments on HotpotQA, DisentQA, and URAQ using two model families. Key findings show that memory restriction can boost robustness under adversarial retrieval but reduce peak performance, and that model-family and scale effects dominate behavior more than instruction type alone. The work highlights the necessity of user-centric benchmarking for real-world RALMs and provides insights into optimizing performance across varied retrieval contexts, with URAQ released to support future research.

Abstract

Recent advancements in Retrieval-Augmented Language Models (RALMs) have demonstrated their efficacy in knowledge-intensive tasks. However, existing evaluation benchmarks often assume a single optimal approach to leveraging retrieved information, failing to account for varying user needs. This paper introduces a novel evaluation framework that systematically assesses RALMs under three user need cases-Context-Exclusive, Context-First, and Memory-First-across three distinct context settings: Context Matching, Knowledge Conflict, and Information Irrelevant. By varying both user instructions and the nature of retrieved information, our approach captures the complexities of real-world applications where models must adapt to diverse user requirements. Through extensive experiments on multiple QA datasets, including HotpotQA, DisentQA, and our newly constructed synthetic URAQ dataset, we find that restricting memory usage improves robustness in adversarial retrieval conditions but decreases peak performance with ideal retrieval results and model family dominates behavioral differences. Our findings highlight the necessity of user-centric evaluations in the development of retrieval-augmented systems and provide insights into optimizing model performance across varied retrieval contexts. We will release our code and URAQ dataset upon acceptance of the paper.

Paper Structure

This paper contains 55 sections, 1 equation, 12 figures, 3 tables.

Figures (12)

  • Figure 1: User needs may have different directions on how to use retrieved context and internal memory as knowledge sources and most of the previous work only focused on a small portion of them.
  • Figure 2: An illustration of the framework with an example question with its possible retrieved context and the ground truth answer under each situation. According to different user needs and context settings, the ground truth answer can be different, reflecting instructed faithfulness (e.g., to 'Proxima Centauri' if dictated by context and user need) rather than absolute factual correctness.
  • Figure 3: Overall user need performance curve of all models on each dataset.
  • Figure 4: Case-Level Accuracy curve of Qwen2.5 and Llama-3.1 on HotpotQA.
  • Figure 5: Accuracy curve of Qwen2.5-72B-Instruct on HotpotQA dataset under all context settings with Context-First and Memory-First.
  • ...and 7 more figures