Table of Contents
Fetching ...

TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge Assistants

Mohammad Aliannejadi, Zahra Abbasiantaeb, Shubham Chatterjee, Jeffery Dalton, Leif Azzopardi

TL;DR

The paper presents TREC iKAT 2023 resources as an extension of conversational search evaluation with personalized, knowledge-intensive tasks. It introduces PTKB-driven, multi-path dialogues over 20 topics (36 dialogues total), a large ClueWeb22-B passage subset, and four assessment tracks (PTKB relevance, passage retrieval, and response generation with human and GPT-4 judgments). Key contributions include detailed evaluation protocols, multiple baselines and runs, and public resources to enable reproducible research on personalized conversational assistants and decisional search tasks. Findings show that multi-stage pipelines starting with learned knowledge (G→R→G) generally improve passage ranking, while PTKB provenance adds complexity; GPT-4-based judgments align with human judgments on relevance and completeness but reveal grounding challenges and evaluator biases. Overall, iKAT provides a practical benchmark for evaluating context-aware and persona-driven conversational agents, with plans to expand topics and personas to enhance scalability and realism.

Abstract

Conversational information seeking has evolved rapidly in the last few years with the development of Large Language Models (LLMs), providing the basis for interpreting and responding in a naturalistic manner to user requests. The extended TREC Interactive Knowledge Assistance Track (iKAT) collection aims to enable researchers to test and evaluate their Conversational Search Agents (CSA). The collection contains a set of 36 personalized dialogues over 20 different topics each coupled with a Personal Text Knowledge Base (PTKB) that defines the bespoke user personas. A total of 344 turns with approximately 26,000 passages are provided as assessments on relevance, as well as additional assessments on generated responses over four key dimensions: relevance, completeness, groundedness, and naturalness. The collection challenges CSA to efficiently navigate diverse personal contexts, elicit pertinent persona information, and employ context for relevant conversations. The integration of a PTKB and the emphasis on decisional search tasks contribute to the uniqueness of this test collection, making it an essential benchmark for advancing research in conversational and interactive knowledge assistants.

TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge Assistants

TL;DR

The paper presents TREC iKAT 2023 resources as an extension of conversational search evaluation with personalized, knowledge-intensive tasks. It introduces PTKB-driven, multi-path dialogues over 20 topics (36 dialogues total), a large ClueWeb22-B passage subset, and four assessment tracks (PTKB relevance, passage retrieval, and response generation with human and GPT-4 judgments). Key contributions include detailed evaluation protocols, multiple baselines and runs, and public resources to enable reproducible research on personalized conversational assistants and decisional search tasks. Findings show that multi-stage pipelines starting with learned knowledge (G→R→G) generally improve passage ranking, while PTKB provenance adds complexity; GPT-4-based judgments align with human judgments on relevance and completeness but reveal grounding challenges and evaluator biases. Overall, iKAT provides a practical benchmark for evaluating context-aware and persona-driven conversational agents, with plans to expand topics and personas to enhance scalability and realism.

Abstract

Conversational information seeking has evolved rapidly in the last few years with the development of Large Language Models (LLMs), providing the basis for interpreting and responding in a naturalistic manner to user requests. The extended TREC Interactive Knowledge Assistance Track (iKAT) collection aims to enable researchers to test and evaluate their Conversational Search Agents (CSA). The collection contains a set of 36 personalized dialogues over 20 different topics each coupled with a Personal Text Knowledge Base (PTKB) that defines the bespoke user personas. A total of 344 turns with approximately 26,000 passages are provided as assessments on relevance, as well as additional assessments on generated responses over four key dimensions: relevance, completeness, groundedness, and naturalness. The collection challenges CSA to efficiently navigate diverse personal contexts, elicit pertinent persona information, and employ context for relevant conversations. The integration of a PTKB and the emphasis on decisional search tasks contribute to the uniqueness of this test collection, making it an essential benchmark for advancing research in conversational and interactive knowledge assistants.
Paper Structure (25 sections, 6 figures, 5 tables)

This paper contains 25 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Example outcomes given a conversation on alternatives to cow's milk with three different personas.
  • Figure 2: Number turns evaluated per dialogue in the final judgment pool vs. the maximum depth of each topic.
  • Figure 3: nDCG@5 aggregated for each dialogue across all runs on the passage ranking task. We report the average across runs, median or better.
  • Figure 4: nDCG@5 at varying conversation turn depths on the passage ranking task. We report the average across runs, median or better.
  • Figure 5: nDCG@5 at varying conversation turn depths on the passage ranking task, for turns that depend on PTKB statements vs. those that do not. We report the average across runs, median or better.
  • ...and 1 more figures