ConvoGen: Enhancing Conversational AI with Synthetic Data: A Multi-Agent Approach
Reem Gody, Mahmoud Goudy, Ahmed Y. Tawfik
TL;DR
ConvoGen introduces a multi-agent framework built on AutoGen to generate synthetic open-domain conversational data with persona-driven agents. By coupling an experience generator (powered by GPT-4o and few-shot learning) with iterative sampling and a group-chat instantiation process, it achieves high lexical diversity, as measured by $MTLD$, and strong grounding to input experiences via an LLM-based judge. The approach is evaluated against several human baselines across multiple configurations, demonstrating that iterative sampling increases diversity and that the generated data can be well-grounded in topic, situation, and personas. While promising for augmenting multi-party conversational datasets, the work notes potential risks from content bias or harmful outputs and emphasizes the need for safety filters and careful prompt tuning to ensure reliability in practice.
Abstract
In this paper, we present ConvoGen: an innovative framework for generating synthetic conversational data using multi-agent systems. Our method leverages few-shot learning and introduces iterative sampling from a dynamically updated few-shot hub to create diverse and realistic conversational scenarios. The generated data has numerous applications, including training and evaluating conversational AI models, and augmenting existing datasets for tasks like conversational intent classification or conversation summarization. Our experiments demonstrate the effectiveness of this method in producing high-quality diverse synthetic conversational data, highlighting its potential to enhance the development and evaluation of conversational AI systems.
