Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling

Maximillian Chen; Ruoxi Sun; Sercan Ö. Arık

Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling

Maximillian Chen, Ruoxi Sun, Sercan Ö. Arık

TL;DR

The paper tackles the challenge of extending multimodal language models to spoken conversations by adopting a data-centric, multi-task learning approach. It designs three auxiliary tasks—Listening Comprehension, Cross-Modal Commonsense Reasoning, and Response Generation—to extract maximal cross-modal signals from a fixed set of speech data, enabling efficient adaptation without large-scale data collection. A new ASK-QA dataset is introduced to stress mixed-initiative, multi-turn spoken dialogue with ambiguity, and the method achieves state-of-the-art results on Spoken-SQuAD using only 10% of the data, with strong results on SD-QA as well. The work demonstrates that carefully crafted auxiliary tasks can significantly boost end-to-end spoken question answering performance across open-weight and closed-weight models, offering a scalable path for audio-centric conversational modeling.

Abstract

Conversational assistants are increasingly popular across diverse real-world applications, highlighting the need for advanced multimodal speech modeling. Speech, as a natural mode of communication, encodes rich user-specific characteristics such as speaking rate and pitch, making it critical for effective interaction. Our work introduces a data-centric customization approach for efficiently enhancing multimodal understanding in conversational speech modeling. Central to our contributions is a novel multi-task learning paradigm that involves designing auxiliary tasks to utilize a small amount of speech data. Our approach achieves state-of-the-art performance on the Spoken-SQuAD benchmark, using only 10% of the training data with open-weight models, establishing a robust and efficient framework for audio-centric conversational modeling. We also introduce ASK-QA, the first dataset for multi-turn spoken dialogue with ambiguous user requests and dynamic evaluation inputs. Code and data forthcoming.

Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling

TL;DR

Abstract

Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)