MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

Sara Rosenthal; Yannis Katsis; Vraj Shah; Lihong He; Lucian Popa; Marina Danilevsky

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

Sara Rosenthal, Yannis Katsis, Vraj Shah, Lihong He, Lucian Popa, Marina Danilevsky

TL;DR

A benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models, and shows that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses.

Abstract

We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at https://github.com/IBM/mt-rag-benchmark

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

TL;DR

Abstract

Paper Structure (15 sections, 6 figures, 5 tables)

This paper contains 15 sections, 6 figures, 5 tables.

Introduction
Benchmark Creation
Task Definitions
Document Corpora
Benchmark: Tasks and Statistics
Evaluation
Metrics
Retrieval
Generation
Conclusion and Future Work
Acknowledgments
Stats and Metrics
Details on UNderspecified
Stitching of the underspecified questions
Validation

Figures (6)

Figure 1: Portions of three conversations highlighting the challenges in MTRAG-UN. The answerability is shown using the assistant response color: answerable, unanswerable, and underspecified. The multi-turn type is shown using the question circle: follow-up and clarification. The last two examples show non-standalone questions.
Figure 2: Distribution of tasks in MTRAG-UN based on different dimensions.
Figure 3: Generation results in the Reference ($\bullet$) setting using, $\textrm{RB}_\textrm{alg}$, on three different dimensions.
Figure 4: Distribution of tasks in MTRAG-UN based on conversational turn.
Figure 5: Weighted Spearman correlation: automated judge metrics vs human evaluation metrics.
...and 1 more figures

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

TL;DR

Abstract

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

Authors

TL;DR

Abstract

Table of Contents

Figures (6)