Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking
Katia Vendrame, Bolaji Yusuf, Santosh Kesiraju, Šimon Sedláček, Oldřich Plchot, Jan Černocký
TL;DR
This work tackles data scarcity and cross-domain generalization in end-to-end spoken dialogue state tracking by introducing a joint training framework that adds a text encoder to process unpaired textual DST data. The multimodal DST model shares components between speech and text paths through a connector and LoRA adapters, while inference remains cost-efficient. Experiments on SpokenWOZ and MultiWOZ demonstrate substantial cross-domain gains when incorporating target-domain text, with larger language models yielding larger improvements. The approach enables multi-domain spoken DST without collecting spoken data for every domain and opens avenues for textual DST augmentation and paraphrasing techniques in future work.
Abstract
End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the efficacy of our proposed method for getting good cross-domain DST performance without relying on spoken training data from the target domains.
