A Sequence-to-Sequence Model for User Simulation in Spoken Dialogue Systems
Layla El Asri, Jing He, Kaheer Suleman
TL;DR
This paper presents a data-driven sequence-to-sequence user simulator for spoken dialogue systems that conditions on the full dialogue history to generate sequences of user intents. By using an encoder–decoder RNN with LSTMs, it captures history without rigid hand-crafted structures and can operate on both coarse compound acts and finer-grained original action spaces. Empirical results on the DSTC2 dataset show superior or competitive F-scores compared to agenda-based and n-gram baselines, with demonstrated generalization to the DSTC3 domain. The approach enables scalable, fine-grained user modeling and has potential to improve training of statistical dialogue policies in new domains.
Abstract
User simulation is essential for generating enough data to train a statistical spoken dialogue system. Previous models for user simulation suffer from several drawbacks, such as the inability to take dialogue history into account, the need of rigid structure to ensure coherent user behaviour, heavy dependence on a specific domain, the inability to output several user intentions during one dialogue turn, or the requirement of a summarized action space for tractability. This paper introduces a data-driven user simulator based on an encoder-decoder recurrent neural network. The model takes as input a sequence of dialogue contexts and outputs a sequence of dialogue acts corresponding to user intentions. The dialogue contexts include information about the machine acts and the status of the user goal. We show on the Dialogue State Tracking Challenge 2 (DSTC2) dataset that the sequence-to-sequence model outperforms an agenda-based simulator and an n-gram simulator, according to F-score. Furthermore, we show how this model can be used on the original action space and thereby models user behaviour with finer granularity.
