Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models

Adnan Ahmad; Stefan Hillmann; Sebastian Möller

Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models

Adnan Ahmad, Stefan Hillmann, Sebastian Möller

TL;DR

This work addresses the bottleneck of obtaining diverse, realistic data for evaluating task-oriented dialogue systems by introducing an LLM-driven user simulation pipeline. It jointly employs GPT-4o and GPT-o1 to generate heterogeneous user profiles via a structured JSON template and uses a TU Berlin study-program database to drive multi-turn dialogues with a modular StudyBot, aided by Mistral-7B-Instruct-v0.2 for response generation. The study finds model-dependent differences in diversity and bias, reporting an overall 82.46% task-success rate across 57 simulations, demonstrating that automated synthetic data can robustly stress-test and inform improvements to dialogue systems. This approach promises scalable, low-cost data generation and evaluation for domain-specific, information-seeking dialogues, with potential impact on rapid system iteration and fairness across user types.

Abstract

In this study, we explore the application of Large Language Models (LLMs) for generating synthetic users and simulating user conversations with a task-oriented dialogue system and present detailed results and their analysis. We propose a comprehensive novel approach to user simulation technique that uses LLMs to create diverse user profiles, set goals, engage in multi-turn dialogues, and evaluate the conversation success. We employ two proprietary LLMs, namely GPT-4o and GPT-o1 (Achiam et al., 2023), to generate a heterogeneous base of user profiles, characterized by varied demographics, multiple user goals, different conversational styles, initial knowledge levels, interests, and conversational objectives. We perform a detailed analysis of the user profiles generated by LLMs to assess the diversity, consistency, and potential biases inherent in these LLM-generated user simulations. We find that GPT-o1 generates more heterogeneous user distribution across most user attributes, while GPT-4o generates more skewed user attributes. The generated set of user profiles are then utilized to simulate dialogue sessions by interacting with a task-oriented dialogue system.

Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models

TL;DR

Abstract

Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)