MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara, Young-Suk Lee, Lucian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, Marina Danilevsky
TL;DR
mtRAG presents the first end-to-end, human-generated multi-turn Retrieval-Augmented Generation benchmark, covering 110 conversations across four domains and 842 tasks to evaluate both retrieval and generation in realistic, dynamic dialogue settings. It demonstrates that current RAG systems struggle, especially on later turns and unanswerable questions, and introduces mtRAG-S, a synthetic companion to enable scalable benchmarking. The study provides extensive analysis of retrieval strategies, model performance, and evaluation metrics (including reference-based and reference-less judges), highlighting gaps between automated scores and human judgments. The work advocates for stronger retrieval and generation capabilities, improved reference-less evaluation metrics, and future extensions to adversarial, multilingual, and domain-expansive scenarios.
Abstract
Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. MTRAG is available at https://github.com/ibm/mt-rag-benchmark.
