Table of Contents
Fetching ...

Adapting Text-based Dialogue State Tracker for Spoken Dialogues

Jaeseok Yoon, Seunghyun Hwang, Ran Han, Jeonguk Bang, Kee-Eung Kim

TL;DR

This work addresses the challenge of transferring a text-based task-oriented DST to spoken dialogues by introducing a three-module pipeline: explicit ASR error correction to align spoken input with ground-truth text, a Description-Driven DST component (D3ST) that leverages slot descriptions and randomized slot ordering, and a post-processing step to recover errors in proper nouns. Using the DSTC11 speech-aware track data, including MultiWOZ-derived audio variants, the approach demonstrates that ASR correction, input description, and targeted post-processing markedly improve joint-goal tracking and reduce slot errors, achieving competitive results (third place) in the track. The study also analyzes worst-case slot errors, particularly for proper nouns, and highlights the importance of ontology-aware handling and separating hotel names from types to further improve performance. Overall, the paper provides a practical and effective blueprint for adapting text-based DST to spoken dialogue systems, with implications for real-world voice assistants and robust spoken interfaces.

Abstract

Although there have been remarkable advances in dialogue systems through the dialogue systems technology competition (DSTC), it remains one of the key challenges to building a robust task-oriented dialogue system with a speech interface. Most of the progress has been made for text-based dialogue systems since there are abundant datasets with written corpora while those with spoken dialogues are very scarce. However, as can be seen from voice assistant systems such as Siri and Alexa, it is of practical importance to transfer the success to spoken dialogues. In this paper, we describe our engineering effort in building a highly successful model that participated in the speech-aware dialogue systems technology challenge track in DSTC11. Our model consists of three major modules: (1) automatic speech recognition error correction to bridge the gap between the spoken and the text utterances, (2) text-based dialogue system (D3ST) for estimating the slots and values using slot descriptions, and (3) post-processing for recovering the error of the estimated slot value. Our experiments show that it is important to use an explicit automatic speech recognition error correction module, post-processing, and data augmentation to adapt a text-based dialogue state tracker for spoken dialogue corpora.

Adapting Text-based Dialogue State Tracker for Spoken Dialogues

TL;DR

This work addresses the challenge of transferring a text-based task-oriented DST to spoken dialogues by introducing a three-module pipeline: explicit ASR error correction to align spoken input with ground-truth text, a Description-Driven DST component (D3ST) that leverages slot descriptions and randomized slot ordering, and a post-processing step to recover errors in proper nouns. Using the DSTC11 speech-aware track data, including MultiWOZ-derived audio variants, the approach demonstrates that ASR correction, input description, and targeted post-processing markedly improve joint-goal tracking and reduce slot errors, achieving competitive results (third place) in the track. The study also analyzes worst-case slot errors, particularly for proper nouns, and highlights the importance of ontology-aware handling and separating hotel names from types to further improve performance. Overall, the paper provides a practical and effective blueprint for adapting text-based DST to spoken dialogue systems, with implications for real-world voice assistants and robust spoken interfaces.

Abstract

Although there have been remarkable advances in dialogue systems through the dialogue systems technology competition (DSTC), it remains one of the key challenges to building a robust task-oriented dialogue system with a speech interface. Most of the progress has been made for text-based dialogue systems since there are abundant datasets with written corpora while those with spoken dialogues are very scarce. However, as can be seen from voice assistant systems such as Siri and Alexa, it is of practical importance to transfer the success to spoken dialogues. In this paper, we describe our engineering effort in building a highly successful model that participated in the speech-aware dialogue systems technology challenge track in DSTC11. Our model consists of three major modules: (1) automatic speech recognition error correction to bridge the gap between the spoken and the text utterances, (2) text-based dialogue system (D3ST) for estimating the slots and values using slot descriptions, and (3) post-processing for recovering the error of the estimated slot value. Our experiments show that it is important to use an explicit automatic speech recognition error correction module, post-processing, and data augmentation to adapt a text-based dialogue state tracker for spoken dialogue corpora.
Paper Structure (16 sections, 5 figures, 5 tables)

This paper contains 16 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Our model structure
  • Figure 2: ASR correction example (MultiWOZ mul0207.json)
  • Figure 3: An example of DST model input. We applied a random ordering mechanism to D3ST-based input.
  • Figure 4: Slots error rate per each slot. Most slots with high slot error rates are slots with proper nouns as slot values, for example, taxi-destination (54.5%), taxi-arriveby (40.5%), and restaurant-name (39.5%). Red-colored slots contain proper nouns.
  • Figure 5: The cause of the error in hotel-type slot. Most of the reasons for the error in hotel-type slot were believed to have been caused by underestimation.