Table of Contents
Fetching ...

Are LLMs Robust for Spoken Dialogues?

Seyed Mahed Mousavi, Gabriel Roccabruna, Simone Alghisi, Massimo Rizzoli, Mirco Ravanelli, Giuseppe Riccardi

TL;DR

This paper assesses whether large language models maintain robustness in spoken task-oriented dialogues by analyzing ASR noise patterns and injecting them into training. They fine-tune GPT-2 for response generation and T5 for dialogue state tracking on clean TOD data, then evaluate with spoken HV and HP variants of MultiWOZ 2.1, using both automatic metrics and human judgments. The key finding is that LLMs are not robust to spoken noise without spoken TOD training data; noise-aware fine-tuning yields notable improvements in end-to-end response quality according to humans, while DST shows limited gains unless slot-value noise is targeted. These results highlight the importance of spokenTOD datasets and caution against sole reliance on automatic metrics, guiding future work toward robust spoken dialogue systems.

Abstract

Large Pre-Trained Language Models have demonstrated state-of-the-art performance in different downstream tasks, including dialogue state tracking and end-to-end response generation. Nevertheless, most of the publicly available datasets and benchmarks on task-oriented dialogues focus on written conversations. Consequently, the robustness of the developed models to spoken interactions is unknown. In this work, we have evaluated the performance of LLMs for spoken task-oriented dialogues on the DSTC11 test sets. Due to the lack of proper spoken dialogue datasets, we have automatically transcribed a development set of spoken dialogues with a state-of-the-art ASR engine. We have characterized the ASR-error types and their distributions and simulated these errors in a large dataset of dialogues. We report the intrinsic (perplexity) and extrinsic (human evaluation) performance of fine-tuned GPT-2 and T5 models in two subtasks of response generation and dialogue state tracking, respectively. The results show that LLMs are not robust to spoken noise by default, however, fine-tuning/training such models on a proper dataset of spoken TODs can result in a more robust performance.

Are LLMs Robust for Spoken Dialogues?

TL;DR

This paper assesses whether large language models maintain robustness in spoken task-oriented dialogues by analyzing ASR noise patterns and injecting them into training. They fine-tune GPT-2 for response generation and T5 for dialogue state tracking on clean TOD data, then evaluate with spoken HV and HP variants of MultiWOZ 2.1, using both automatic metrics and human judgments. The key finding is that LLMs are not robust to spoken noise without spoken TOD training data; noise-aware fine-tuning yields notable improvements in end-to-end response quality according to humans, while DST shows limited gains unless slot-value noise is targeted. These results highlight the importance of spokenTOD datasets and caution against sole reliance on automatic metrics, guiding future work toward robust spoken dialogue systems.

Abstract

Large Pre-Trained Language Models have demonstrated state-of-the-art performance in different downstream tasks, including dialogue state tracking and end-to-end response generation. Nevertheless, most of the publicly available datasets and benchmarks on task-oriented dialogues focus on written conversations. Consequently, the robustness of the developed models to spoken interactions is unknown. In this work, we have evaluated the performance of LLMs for spoken task-oriented dialogues on the DSTC11 test sets. Due to the lack of proper spoken dialogue datasets, we have automatically transcribed a development set of spoken dialogues with a state-of-the-art ASR engine. We have characterized the ASR-error types and their distributions and simulated these errors in a large dataset of dialogues. We report the intrinsic (perplexity) and extrinsic (human evaluation) performance of fine-tuned GPT-2 and T5 models in two subtasks of response generation and dialogue state tracking, respectively. The results show that LLMs are not robust to spoken noise by default, however, fine-tuning/training such models on a proper dataset of spoken TODs can result in a more robust performance.
Paper Structure (12 sections, 4 figures, 4 tables)

This paper contains 12 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The influence of fine-tuning the model (GPT-2 Small) using different portions of the clean and noisy dialogues. The model is evaluated on three test sets of clean (MultiWOZ 2.1) dialogues, HV, and HP.
  • Figure 2: The lexical similarity among the generated responses (GPT-2 Medium) and ground truth in different fine-tuning settings. The lowest lexical similarity is among generated responses and the ground truth.
  • Figure 3: The error types selected by the human annotators to explain their negative judgments (Not Appropriate, and Not Contextualized) of the response candidates.
  • Figure 4: A dialogue example from the Human-Paraphrased test set with wrong recognition of the word "trains" (wrongly recognized as "trends" by the ASR model). While the model fine-tuned on clean dialogues fails to handle this error, the model fine-tuned on noisy TODs is more robust to such errors and generates an appropriate response.