Table of Contents
Fetching ...

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

Yohan Lee, Yongwoo Song, Sangyeop Kim

TL;DR

The paper introduces the Conversational Data Retrieval (CDR) benchmark, the first comprehensive evaluation suite for retrieving information from conversation histories rather than documents, addressing multi-turn dynamics and implicit states. It aggregates 1.6k queries over five analytical tasks and 9.1k conversations, and evaluates 16 embedding models, revealing that even top performers achieve only about 0.51 NDCG@10, highlighting a substantial gap to practical readiness. The authors design a multi-stage data curation and synthesis pipeline—including query templates, synthetic aligned conversations, reranking, and relevance classification—to produce high-quality, diverse benchmark data with explicit validation. They further provide a five-task taxonomy, practical query templates, and detailed error analyses that uncover core challenges in conversational data retrieval, such as understanding turn progression and implicit references, and lay groundwork for future conversation-aware retrieval techniques with real-world product insights.

Abstract

We present the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set for evaluating systems that retrieve conversation data for product insights. With 1.6k queries across five analytical tasks and 9.1k conversations, our benchmark provides a reliable standard for measuring conversational data retrieval performance. Our evaluation of 16 popular embedding models shows that even the best models reach only around NDCG@10 of 0.51, revealing a substantial gap between document and conversational data retrieval capabilities. Our work identifies unique challenges in conversational data retrieval (implicit state recognition, turn dynamics, contextual references) while providing practical query templates and detailed error analysis across different task categories. The benchmark dataset and code are available at https://github.com/l-yohai/CDR-Benchmark.

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

TL;DR

The paper introduces the Conversational Data Retrieval (CDR) benchmark, the first comprehensive evaluation suite for retrieving information from conversation histories rather than documents, addressing multi-turn dynamics and implicit states. It aggregates 1.6k queries over five analytical tasks and 9.1k conversations, and evaluates 16 embedding models, revealing that even top performers achieve only about 0.51 NDCG@10, highlighting a substantial gap to practical readiness. The authors design a multi-stage data curation and synthesis pipeline—including query templates, synthetic aligned conversations, reranking, and relevance classification—to produce high-quality, diverse benchmark data with explicit validation. They further provide a five-task taxonomy, practical query templates, and detailed error analyses that uncover core challenges in conversational data retrieval, such as understanding turn progression and implicit references, and lay groundwork for future conversation-aware retrieval techniques with real-world product insights.

Abstract

We present the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set for evaluating systems that retrieve conversation data for product insights. With 1.6k queries across five analytical tasks and 9.1k conversations, our benchmark provides a reliable standard for measuring conversational data retrieval performance. Our evaluation of 16 popular embedding models shows that even the best models reach only around NDCG@10 of 0.51, revealing a substantial gap between document and conversational data retrieval capabilities. Our work identifies unique challenges in conversational data retrieval (implicit state recognition, turn dynamics, contextual references) while providing practical query templates and detailed error analysis across different task categories. The benchmark dataset and code are available at https://github.com/l-yohai/CDR-Benchmark.

Paper Structure

This paper contains 37 sections, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Comparison between traditional document retrieval and conversational data retrieval.
  • Figure 2: An overview of the Conversational Data Retrieval (CDR) benchmark construction pipeline. (A) Collect and filter large-scale conversational data. (B) Generate query templates across five key areas. (C) Synthesize query-aligned conversations with LLMs. (D) Map relevance through reranking, human assessment, and classifier verification. (E) Integrate the processed data into a standardized CDR evaluation benchmark.
  • Figure 3: Domain distribution in the CDR benchmark dataset, showing diverse coverage across categories.
  • Figure 4: Task-specific NDCG@10 performance comparison of top-performing embedding models and category winners. All results are available in Appendix \ref{['apdx:all_result']}.
  • Figure 5: Prompt for generating synthetic conversations that match specific queries.
  • ...and 3 more figures