Table of Contents
Fetching ...

Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking

Katia Vendrame, Bolaji Yusuf, Santosh Kesiraju, Šimon Sedláček, Oldřich Plchot, Jan Černocký

TL;DR

This work tackles data scarcity and cross-domain generalization in end-to-end spoken dialogue state tracking by introducing a joint training framework that adds a text encoder to process unpaired textual DST data. The multimodal DST model shares components between speech and text paths through a connector and LoRA adapters, while inference remains cost-efficient. Experiments on SpokenWOZ and MultiWOZ demonstrate substantial cross-domain gains when incorporating target-domain text, with larger language models yielding larger improvements. The approach enables multi-domain spoken DST without collecting spoken data for every domain and opens avenues for textual DST augmentation and paraphrasing techniques in future work.

Abstract

End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the efficacy of our proposed method for getting good cross-domain DST performance without relying on spoken training data from the target domains.

Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking

TL;DR

This work tackles data scarcity and cross-domain generalization in end-to-end spoken dialogue state tracking by introducing a joint training framework that adds a text encoder to process unpaired textual DST data. The multimodal DST model shares components between speech and text paths through a connector and LoRA adapters, while inference remains cost-efficient. Experiments on SpokenWOZ and MultiWOZ demonstrate substantial cross-domain gains when incorporating target-domain text, with larger language models yielding larger improvements. The approach enables multi-domain spoken DST without collecting spoken data for every domain and opens avenues for textual DST augmentation and paraphrasing techniques in future work.

Abstract

End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the efficacy of our proposed method for getting good cross-domain DST performance without relying on spoken training data from the target domains.

Paper Structure

This paper contains 12 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: E2E DST model under various training regimes.
  • Figure 2: Variation of JGA as text loss weight is varied. In the legend, X, Y$\rightarrow$Z denotes a model trained on paired speech data from X and unpaired text from Y, and tested on the validation set from Z.