Table of Contents
Fetching ...

ProCIS: A Benchmark for Proactive Retrieval in Conversations

Chris Samarinas, Hamed Zamani

TL;DR

ProCIS addresses the lack of datasets and evaluation protocols for proactive conversational information seeking by introducing a large-scale Reddit-Wikipedia dataset with over 2.8 million multi-party conversations. It establishes a dedicated evaluation framework, including depth-$k$ pooling for relevance judgments and the normalized proactive discounted cumulative gain ($npDCG$) metric to measure timeliness and novelty in proactive retrieval. The authors present a novel Language Model Grounded Retrieval (LMGR) framework and benchmark a spectrum of baselines, demonstrating strong reactive performance for LMGR and highlighting the advantages of traditional neural models for proactive retrieval. By releasing data, code, and benchmarks, ProCIS aims to catalyze the development of proactive CIS systems and advance the integration of retrieval-augmented conversational agents in open-domain settings.

Abstract

The field of conversational information seeking, which is rapidly gaining interest in both academia and industry, is changing how we interact with search engines through natural language interactions. Existing datasets and methods are mostly evaluating reactive conversational information seeking systems that solely provide response to every query from the user. We identify a gap in building and evaluating proactive conversational information seeking systems that can monitor a multi-party human conversation and proactively engage in the conversation at an opportune moment by retrieving useful resources and suggestions. In this paper, we introduce a large-scale dataset for proactive document retrieval that consists of over 2.8 million conversations. We conduct crowdsourcing experiments to obtain high-quality and relatively complete relevance judgments through depth-k pooling. We also collect annotations related to the parts of the conversation that are related to each document, enabling us to evaluate proactive retrieval systems. We introduce normalized proactive discounted cumulative gain (npDCG) for evaluating these systems, and further provide benchmark results for a wide range of models, including a novel model we developed for this task. We believe that the developed dataset, called ProCIS, paves the path towards developing proactive conversational information seeking systems.

ProCIS: A Benchmark for Proactive Retrieval in Conversations

TL;DR

ProCIS addresses the lack of datasets and evaluation protocols for proactive conversational information seeking by introducing a large-scale Reddit-Wikipedia dataset with over 2.8 million multi-party conversations. It establishes a dedicated evaluation framework, including depth- pooling for relevance judgments and the normalized proactive discounted cumulative gain () metric to measure timeliness and novelty in proactive retrieval. The authors present a novel Language Model Grounded Retrieval (LMGR) framework and benchmark a spectrum of baselines, demonstrating strong reactive performance for LMGR and highlighting the advantages of traditional neural models for proactive retrieval. By releasing data, code, and benchmarks, ProCIS aims to catalyze the development of proactive CIS systems and advance the integration of retrieval-augmented conversational agents in open-domain settings.

Abstract

The field of conversational information seeking, which is rapidly gaining interest in both academia and industry, is changing how we interact with search engines through natural language interactions. Existing datasets and methods are mostly evaluating reactive conversational information seeking systems that solely provide response to every query from the user. We identify a gap in building and evaluating proactive conversational information seeking systems that can monitor a multi-party human conversation and proactively engage in the conversation at an opportune moment by retrieving useful resources and suggestions. In this paper, we introduce a large-scale dataset for proactive document retrieval that consists of over 2.8 million conversations. We conduct crowdsourcing experiments to obtain high-quality and relatively complete relevance judgments through depth-k pooling. We also collect annotations related to the parts of the conversation that are related to each document, enabling us to evaluate proactive retrieval systems. We introduce normalized proactive discounted cumulative gain (npDCG) for evaluating these systems, and further provide benchmark results for a wide range of models, including a novel model we developed for this task. We believe that the developed dataset, called ProCIS, paves the path towards developing proactive conversational information seeking systems.
Paper Structure (22 sections, 4 equations, 7 figures, 4 tables)

This paper contains 22 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The online annotation interface for the crowd-sourcing of the test set.
  • Figure 2: Utterance length (# tokens) distribution in ProCIS.
  • Figure 3: Conversation length (# tokens) distribution in ProCIS.
  • Figure 4: Conversation length (# turns) distribution in ProCIS.
  • Figure 5: t-SNE visualization of the categories (subreddits) in the ProCIS test set.
  • ...and 2 more figures