Table of Contents
Fetching ...

QAConv: Question Answering on Informative Conversations

Chien-Sheng Wu, Andrea Madotto, Wenhao Liu, Pascale Fung, Caiming Xiong

TL;DR

QAConv introduces a dataset for question answering grounded in informative conversations (emails, panel discussions, and work channels), addressing the gap between document QA and dialogue QA. It presents a four-stage data collection pipeline augmented by a question generator and dialogue summarizer to yield 34,608 QA pairs from 10,259 conversations, with two evaluation modes: chunk-mode and full-mode. Experiments show state-of-the-art QA systems have limited zero-shot performance on dialogue-grounded questions and often predict answers as unanswerable, underscoring the need for dialogue-aware reasoning and retrieval. The dataset provides a challenging benchmark for conversational QA and enables study of retrieval-augmented QA and multi-hop reasoning in real-world dialogue sources.

Abstract

This paper introduces QAConv, a new question answering (QA) dataset that uses conversations as a knowledge source. We focus on informative conversations, including business emails, panel discussions, and work channels. Unlike open-domain and task-oriented dialogues, these conversations are usually long, complex, asynchronous, and involve strong domain knowledge. In total, we collect 34,608 QA pairs from 10,259 selected conversations with both human-written and machine-generated questions. We use a question generator and a dialogue summarizer as auxiliary tools to collect and recommend questions. The dataset has two testing scenarios: chunk mode and full mode, depending on whether the grounded partial conversation is provided or retrieved. Experimental results show that state-of-the-art pretrained QA systems have limited zero-shot performance and tend to predict our questions as unanswerable. Our dataset provides a new training and evaluation testbed to facilitate QA on conversations research.

QAConv: Question Answering on Informative Conversations

TL;DR

QAConv introduces a dataset for question answering grounded in informative conversations (emails, panel discussions, and work channels), addressing the gap between document QA and dialogue QA. It presents a four-stage data collection pipeline augmented by a question generator and dialogue summarizer to yield 34,608 QA pairs from 10,259 conversations, with two evaluation modes: chunk-mode and full-mode. Experiments show state-of-the-art QA systems have limited zero-shot performance on dialogue-grounded questions and often predict answers as unanswerable, underscoring the need for dialogue-aware reasoning and retrieval. The dataset provides a challenging benchmark for conversational QA and enables study of retrieval-augmented QA and multi-hop reasoning in real-world dialogue sources.

Abstract

This paper introduces QAConv, a new question answering (QA) dataset that uses conversations as a knowledge source. We focus on informative conversations, including business emails, panel discussions, and work channels. Unlike open-domain and task-oriented dialogues, these conversations are usually long, complex, asynchronous, and involve strong domain knowledge. In total, we collect 34,608 QA pairs from 10,259 selected conversations with both human-written and machine-generated questions. We use a question generator and a dialogue summarizer as auxiliary tools to collect and recommend questions. The dataset has two testing scenarios: chunk mode and full mode, depending on whether the grounded partial conversation is provided or retrieved. Experimental results show that state-of-the-art pretrained QA systems have limited zero-shot performance and tend to predict our questions as unanswerable. Our dataset provides a new training and evaluation testbed to facilitate QA on conversations research.

Paper Structure

This paper contains 36 sections, 6 figures, 18 tables.

Figures (6)

  • Figure 1: An example of question answering on conversations and the data collection flow.
  • Figure 2: Question type tree map and examples (Best view in color).
  • Figure 3: Diversity in answers in all categories.
  • Figure 4: Screenshot for human-written QA collection.
  • Figure 5: Screenshot for machine-generated QA collection.
  • ...and 1 more figures