Table of Contents
Fetching ...

A Survey on Recent Advances in Conversational Data Generation

Heydar Soudani, Roxana Petcu, Evangelos Kanoulas, Faegheh Hasibi

TL;DR

The paper surveys synthetic, multi-turn conversational data generation across task-oriented, open-domain, and information-seeking paradigms, proposing a unifying three-stage framework of seed generation, utterance generation, and quality filtering. It analyzes training versus simulation for TOD, diverse seed sources for ODD and CIS (including knowledge graphs, documents, and user profiles), and a spectrum of generation techniques from templates to large-language-model prompts and latent-variable methods. It provides a comprehensive review of evaluation approaches, detailing automatic (reference-based and reference-free) and human metrics, and discusses issues of factuality, diversity, and coherence with extrinsic and intrinsic assessments. The discussion highlights challenges in factual grounding, personalization, safety, and bias, and outlines future directions for controllable, scalable, and evaluable synthetic data to boost downstream dialogue systems.

Abstract

Recent advancements in conversational systems have significantly enhanced human-machine interactions across various domains. However, training these systems is challenging due to the scarcity of specialized dialogue data. Traditionally, conversational datasets were created through crowdsourcing, but this method has proven costly, limited in scale, and labor-intensive. As a solution, the development of synthetic dialogue data has emerged, utilizing techniques to augment existing datasets or convert textual resources into conversational formats, providing a more efficient and scalable approach to dataset creation. In this survey, we offer a systematic and comprehensive review of multi-turn conversational data generation, focusing on three types of dialogue systems: open domain, task-oriented, and information-seeking. We categorize the existing research based on key components like seed data creation, utterance generation, and quality filtering methods, and introduce a general framework that outlines the main principles of conversation data generation systems. Additionally, we examine the evaluation metrics and methods for assessing synthetic conversational data, address current challenges in the field, and explore potential directions for future research. Our goal is to accelerate progress for researchers and practitioners by presenting an overview of state-of-the-art methods and highlighting opportunities to further research in this area.

A Survey on Recent Advances in Conversational Data Generation

TL;DR

The paper surveys synthetic, multi-turn conversational data generation across task-oriented, open-domain, and information-seeking paradigms, proposing a unifying three-stage framework of seed generation, utterance generation, and quality filtering. It analyzes training versus simulation for TOD, diverse seed sources for ODD and CIS (including knowledge graphs, documents, and user profiles), and a spectrum of generation techniques from templates to large-language-model prompts and latent-variable methods. It provides a comprehensive review of evaluation approaches, detailing automatic (reference-based and reference-free) and human metrics, and discusses issues of factuality, diversity, and coherence with extrinsic and intrinsic assessments. The discussion highlights challenges in factual grounding, personalization, safety, and bias, and outlines future directions for controllable, scalable, and evaluable synthetic data to boost downstream dialogue systems.

Abstract

Recent advancements in conversational systems have significantly enhanced human-machine interactions across various domains. However, training these systems is challenging due to the scarcity of specialized dialogue data. Traditionally, conversational datasets were created through crowdsourcing, but this method has proven costly, limited in scale, and labor-intensive. As a solution, the development of synthetic dialogue data has emerged, utilizing techniques to augment existing datasets or convert textual resources into conversational formats, providing a more efficient and scalable approach to dataset creation. In this survey, we offer a systematic and comprehensive review of multi-turn conversational data generation, focusing on three types of dialogue systems: open domain, task-oriented, and information-seeking. We categorize the existing research based on key components like seed data creation, utterance generation, and quality filtering methods, and introduce a general framework that outlines the main principles of conversation data generation systems. Additionally, we examine the evaluation metrics and methods for assessing synthetic conversational data, address current challenges in the field, and explore potential directions for future research. Our goal is to accelerate progress for researchers and practitioners by presenting an overview of state-of-the-art methods and highlighting opportunities to further research in this area.
Paper Structure (42 sections, 14 figures)

This paper contains 42 sections, 14 figures.

Figures (14)

  • Figure 1: An overview of multi-turn conversation generation sections and papers.
  • Figure 2: An overview of methods for evaluating generated multi-turn conversations.
  • Figure 3: Two-sided simulation as presented in Simulated-Chat mohapatra2021simulatedchats
  • Figure 7: On the left, a TOD example; On the right, the associated key terms, slots and their values.
  • Figure 8: Architecture of a Task-Oriented Dialogue (TOD) system, composed of four modules communicating in a pipeline-fashion. The input and output of each module are illustrated, indicating the process of receiving a user conversational turn, identifying intents, slots and values, extracting response attributes from the data store, and generating them into a natural language response.
  • ...and 9 more figures