MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

Yannis Katsis; Sara Rosenthal; Kshitij Fadnis; Chulaka Gunasekara; Young-Suk Lee; Lucian Popa; Vraj Shah; Huaiyu Zhu; Danish Contractor; Marina Danilevsky

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara, Young-Suk Lee, Lucian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, Marina Danilevsky

TL;DR

mtRAG presents the first end-to-end, human-generated multi-turn Retrieval-Augmented Generation benchmark, covering 110 conversations across four domains and 842 tasks to evaluate both retrieval and generation in realistic, dynamic dialogue settings. It demonstrates that current RAG systems struggle, especially on later turns and unanswerable questions, and introduces mtRAG-S, a synthetic companion to enable scalable benchmarking. The study provides extensive analysis of retrieval strategies, model performance, and evaluation metrics (including reference-based and reference-less judges), highlighting gaps between automated scores and human judgments. The work advocates for stronger retrieval and generation capabilities, improved reference-less evaluation metrics, and future extensions to adversarial, multilingual, and domain-expansive scenarios.

Abstract

Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. MTRAG is available at https://github.com/ibm/mt-rag-benchmark.

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

TL;DR

Abstract

Paper Structure (49 sections, 10 figures, 16 tables)

This paper contains 49 sections, 10 figures, 16 tables.

Introduction
Related Work
mtRAG Benchmark
Dimensions
Conversation Properties
Benchmark Creation
Annotators
Document Corpora
Human-Generated Conversations
Data Statistics
Retrieval
Experimental Setup
Retrieval strategies
Retrieval Results
Generation
...and 34 more sections

Figures (10)

Figure 1: 5/8 turns of a conversation from the CLAPnq domain. The conversation is enriched with question dimensions, passage diversity, and repair. The answerability is shown using the agent response color: answerable, unanswerable, and partial. The multi-turn type is shown using the question circle: follow-up and clarification. The different relevant passages highlight diversity and the original text shows a repair from the model response.
Figure 2: Distribution of tasks in mtRAG based on each of the benchmark's dimensions.
Figure 3: Generation results in the Reference ($\bullet$) retrieval setting using a single metric, $\textrm{RB}_\textrm{alg}$, on three different dimensions: (a) answerability, (b) turns, and (c) domains
Figure 4: Weighted Spearman correlation of human evaluation with the automated metrics on the answerable subset for the GPT-4o and Llama 3.1 405B Inst. models.
Figure 5: Query rewrite prompt
...and 5 more figures

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

TL;DR

Abstract

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (10)