Faithful Persona-based Conversational Dataset Generation with Large Language Models
Pegah Jandaghi, XiangHai Sheng, Xinyi Bai, Jay Pujara, Hakim Sidahmed
TL;DR
This work tackles the scarcity and quality of persona-based dialogue data by introducing a Generator-Critic framework that uses large language models to auto-expand seed personas into a large, faithful conversational corpus. The three-stage pipeline—User Generation, User Pairing, and Conversation Generation—produces Synthetic-Persona-Chat, comprising 5k personas and 20k dialogues, with a mixture-of-experts Critic enforcing quality, faithfulness, and safety. Through extensive automatic and human evaluations, the approach yields richer persona spaces, improved next-utterance prediction, and high human-likeness scores, outperforming Persona-Chat on several fronts. The framework is designed to be domain-agnostic and scalable, reducing human labeling while enabling specialized, domain-specific persona datasets, albeit with notable computational costs and dependence on the underlying LLMs.
Abstract
High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and comprehensive persona-based dataset can lead to conversational models that create a deeper connection with the user, and maintain their engagement. In this paper, we leverage the power of Large Language Models (LLMs) to create a large, high-quality conversational dataset from a seed dataset. We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations. The Generator is an LLM prompted to output conversations. The Critic consists of a mixture of expert LLMs that control the quality of the generated conversations. These experts select the best generated conversations, which we then use to improve the Generator. We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat. We evaluate the quality of Synthetic-Persona-Chat and our generation framework on different dimensions through extensive experiments, and observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat during Turing test decreases from 17.2% to 8.8% over three iterations.
