RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues

Tzu-Lin Kuo; Feng-Ting Liao; Mu-Wei Hsieh; Fu-Chieh Chang; Po-Chun Hsu; Da-Shan Shiu

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues

Tzu-Lin Kuo, Feng-Ting Liao, Mu-Wei Hsieh, Fu-Chieh Chang, Po-Chun Hsu, Da-Shan Shiu

TL;DR

RAD-Bench provides a comprehensive benchmark for evaluating large language models in retrieval-augmented dialogues across multiple turns. By defining Retrieval Synthesis and Retrieval Reasoning and leveraging a data-generation pipeline plus LLM-based evaluators, it reveals how models struggle to maintain performance as user intents and constraints evolve, even with relevant retrieved contexts. The study demonstrates strong discriminative capability relative to industry benchmarks and outlines avenues to broaden scenario diversity, improve evaluation fidelity, and examine potential judge biases. Together, these contributions offer a practical tool for selecting and optimizing LLMs for context-rich, retrieval-enhanced applications.

Abstract

In real-world applications with Large Language Models (LLMs), external retrieval mechanisms - such as Search-Augmented Generation (SAG), tool utilization, and Retrieval-Augmented Generation (RAG) - are often employed to enhance the quality of augmented generations in dialogues. These approaches often come with multi-turn dialogue, where each interaction is enriched by relevant information retrieved from external sources. Existing benchmarks either assess LLMs' chat abilities in multi-turn dialogues or their use of retrieval for augmented responses in single-turn settings. However, there is a gap in evaluating LLMs' ability to leverage retrieval for more precise responses across multiple turns. To address this limitation, we introduce RAD-Bench (Retrieval Augmented Dialogue), a benchmark designed to evaluate LLMs' capabilities in multi-turn dialogues following retrievals, essential for their deployment in context-rich applications. RAD-Bench evaluates two key abilities of LLMs: Retrieval Synthesis and Retrieval Reasoning. These are measured using discriminative questions and retrieved contexts, and corresponding reference answers, assessing how effectively LLMs integrate and reason with context to maintain and enhance conversation quality over multiple turns. Our evaluation results on commonly used LLMs reveal that model performance deteriorates as additional layers of conditions or constraints are applied across conversation turns, even when accurate retrieved contexts are provided. The data and code are available at https://github.com/mtkresearch/RAD-Bench

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues

TL;DR

Abstract

Paper Structure (25 sections, 29 figures, 2 tables)

This paper contains 25 sections, 29 figures, 2 tables.

Introduction
Related Work
Retrieval Augmented Dialogue Benchmark
Evaluated Abilities
Evaluator
Benchmark Construction
Evaluation Results
Evaluation Setup
Main Results
Performance Across Dialogue Turns
Correlation with Chatbot Arena
Conclusions and Future Work
Details on the Data Generation
Data Collection
Question Candidate Generation
...and 10 more sections

Figures (29)

Figure 1: Evaluation Process in Retrieval Augmented Dialogue Benchmark: At each turn, a user question paired with a retrieved context is presented to the LLM for augmented generation. The LLM's response is scored on a scale of 1 to 10 using an LLM-as-a-Judge framework. This framework prompts the judge to assess how well the model utilized the given context to answer progressively changing questions, based on specific criteria, and compare it against a reference answer, ensuring accurate and consistent evaluations across different scenarios.
Figure 2: Correlation between RAD-Bench and Chatbot Arena (Hard-En prompts)chiang_chatbot_2024. Models exhibiting similar level of multi-turn chat capability do not perform similarly when they are applied to dialogues from retrieval, as showcased by results from Llama3.1-8B vs Mistral-Large; from Llama3.1-70B vs Deepseek-V2; from Llama3.1-405B vs GPT-4o. We surmise that the discrepancy could be reduced through including RAFT zhang_raft_2024 in post-trainings, aligning model behaviors closer to the scenarios in retrieval augmented dialogue.
Figure 3: Model performance across turns. (Top): Retrieval Synthesis; (Bottom) Retrieval Reasoning.
Figure 4: Data construction pipeline of RAD-Bench: The blue dashed lines represent scenarios with predetermined context integration at each turn, while the red dashed lines indicate scenarios where context must be retrieved via SAG or RAG, requiring additional search queries during question candidate generation (Phase 2).
Figure 5: Performance of evaluated LLMs
...and 24 more figures

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues

TL;DR

Abstract

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues

Authors

TL;DR

Abstract

Table of Contents

Figures (29)