Table of Contents
Fetching ...

SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

Jonggeun Lee, Junseong Pyo, Jeongmin Park, Yohan Jo

Abstract

Robust task-oriented spoken dialogue agents require exposure to the full diversity of how people interact through speech. Building spoken user simulators that address this requires large-scale spoken task-oriented dialogue (TOD) data encompassing spoken user behaviors, yet existing datasets are limited in scale and domain coverage, with no systematic pipeline for augmenting them. To address this, we introduce \textbf{SpokenTOD}, a spoken TOD dataset of 52,390 dialogues and 1,034 hours of speech augmented with four spoken user behaviors -- cross-turn slots, barge-in, disfluency, and emotional prosody -- across diverse speakers and domains. Building on SpokenTOD, we present \textbf{SpokenUS}, a spoken user simulator grounded in TOD with a dedicated architecture for barge-in. SpokenUS achieves comparable goal coverage to significantly larger models while substantially outperforming all baselines in Human MOS, disclosing slot values gradually across the dialogue as humans do rather than front-loading them. Further analysis confirms that SpokenUS's spoken behaviors pose meaningful challenges to downstream agents, making it a practical tool for training and evaluating more robust spoken dialogue systems.

SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

Abstract

Robust task-oriented spoken dialogue agents require exposure to the full diversity of how people interact through speech. Building spoken user simulators that address this requires large-scale spoken task-oriented dialogue (TOD) data encompassing spoken user behaviors, yet existing datasets are limited in scale and domain coverage, with no systematic pipeline for augmenting them. To address this, we introduce \textbf{SpokenTOD}, a spoken TOD dataset of 52,390 dialogues and 1,034 hours of speech augmented with four spoken user behaviors -- cross-turn slots, barge-in, disfluency, and emotional prosody -- across diverse speakers and domains. Building on SpokenTOD, we present \textbf{SpokenUS}, a spoken user simulator grounded in TOD with a dedicated architecture for barge-in. SpokenUS achieves comparable goal coverage to significantly larger models while substantially outperforming all baselines in Human MOS, disclosing slot values gradually across the dialogue as humans do rather than front-loading them. Further analysis confirms that SpokenUS's spoken behaviors pose meaningful challenges to downstream agents, making it a practical tool for training and evaluating more robust spoken dialogue systems.
Paper Structure (102 sections, 2 equations, 21 figures, 19 tables)

This paper contains 102 sections, 2 equations, 21 figures, 19 tables.

Figures (21)

  • Figure 1: SpokenTOD Construction Pipeline.
  • Figure 2: Overview of SpokenUS. The model processes streaming assistant speech in Listening Mode, determines barge-in timing through a turn-taking head, then generates responses through Pre-scripting and Speaking Modes.
  • Figure 3: Cumulative goal slot coverage over user turns.
  • Figure 4: Prompt template for barge-in error recovery (INCOHERENT RAW).
  • Figure 5: Prompt template for barge-in error recovery (incoherent interp).
  • ...and 16 more figures