Table of Contents
Fetching ...

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

Sara Rosenthal, Yannis Katsis, Vraj Shah, Lihong He, Lucian Popa, Marina Danilevsky

TL;DR

A benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models, and shows that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses.

Abstract

We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at https://github.com/IBM/mt-rag-benchmark

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

TL;DR

A benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models, and shows that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses.

Abstract

We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at https://github.com/IBM/mt-rag-benchmark
Paper Structure (15 sections, 6 figures, 5 tables)

This paper contains 15 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Portions of three conversations highlighting the challenges in MTRAG-UN. The answerability is shown using the assistant response color: answerable, unanswerable, and underspecified. The multi-turn type is shown using the question circle: follow-up and clarification. The last two examples show non-standalone questions.
  • Figure 2: Distribution of tasks in MTRAG-UN based on different dimensions.
  • Figure 3: Generation results in the Reference ($\bullet$) setting using, $\textrm{RB}_\textrm{alg}$, on three different dimensions.
  • Figure 4: Distribution of tasks in MTRAG-UN based on conversational turn.
  • Figure 5: Weighted Spearman correlation: automated judge metrics vs human evaluation metrics.
  • ...and 1 more figures