Table of Contents
Fetching ...

Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling

Maximillian Chen, Ruoxi Sun, Sercan Ö. Arık

TL;DR

The paper tackles the challenge of extending multimodal language models to spoken conversations by adopting a data-centric, multi-task learning approach. It designs three auxiliary tasks—Listening Comprehension, Cross-Modal Commonsense Reasoning, and Response Generation—to extract maximal cross-modal signals from a fixed set of speech data, enabling efficient adaptation without large-scale data collection. A new ASK-QA dataset is introduced to stress mixed-initiative, multi-turn spoken dialogue with ambiguity, and the method achieves state-of-the-art results on Spoken-SQuAD using only 10% of the data, with strong results on SD-QA as well. The work demonstrates that carefully crafted auxiliary tasks can significantly boost end-to-end spoken question answering performance across open-weight and closed-weight models, offering a scalable path for audio-centric conversational modeling.

Abstract

Conversational assistants are increasingly popular across diverse real-world applications, highlighting the need for advanced multimodal speech modeling. Speech, as a natural mode of communication, encodes rich user-specific characteristics such as speaking rate and pitch, making it critical for effective interaction. Our work introduces a data-centric customization approach for efficiently enhancing multimodal understanding in conversational speech modeling. Central to our contributions is a novel multi-task learning paradigm that involves designing auxiliary tasks to utilize a small amount of speech data. Our approach achieves state-of-the-art performance on the Spoken-SQuAD benchmark, using only 10% of the training data with open-weight models, establishing a robust and efficient framework for audio-centric conversational modeling. We also introduce ASK-QA, the first dataset for multi-turn spoken dialogue with ambiguous user requests and dynamic evaluation inputs. Code and data forthcoming.

Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling

TL;DR

The paper tackles the challenge of extending multimodal language models to spoken conversations by adopting a data-centric, multi-task learning approach. It designs three auxiliary tasks—Listening Comprehension, Cross-Modal Commonsense Reasoning, and Response Generation—to extract maximal cross-modal signals from a fixed set of speech data, enabling efficient adaptation without large-scale data collection. A new ASK-QA dataset is introduced to stress mixed-initiative, multi-turn spoken dialogue with ambiguity, and the method achieves state-of-the-art results on Spoken-SQuAD using only 10% of the data, with strong results on SD-QA as well. The work demonstrates that carefully crafted auxiliary tasks can significantly boost end-to-end spoken question answering performance across open-weight and closed-weight models, offering a scalable path for audio-centric conversational modeling.

Abstract

Conversational assistants are increasingly popular across diverse real-world applications, highlighting the need for advanced multimodal speech modeling. Speech, as a natural mode of communication, encodes rich user-specific characteristics such as speaking rate and pitch, making it critical for effective interaction. Our work introduces a data-centric customization approach for efficiently enhancing multimodal understanding in conversational speech modeling. Central to our contributions is a novel multi-task learning paradigm that involves designing auxiliary tasks to utilize a small amount of speech data. Our approach achieves state-of-the-art performance on the Spoken-SQuAD benchmark, using only 10% of the training data with open-weight models, establishing a robust and efficient framework for audio-centric conversational modeling. We also introduce ASK-QA, the first dataset for multi-turn spoken dialogue with ambiguous user requests and dynamic evaluation inputs. Code and data forthcoming.

Paper Structure

This paper contains 46 sections, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Automatic speech recognition is a necessary implicit skill for MLLM in end-to-end spoken question answering. We propose a multi-task learning approach which explicitly teaches these skills, as exemplified by this QA pair from Spoken-SQuAD.
  • Figure 2: Simplified summary of the pipeline for constructing ASK-QA. For each text conversation in Abg-CoQA, we construct three speaker profiles with randomly sampled voices, speaking rates, and pitches. We use TTS to synthesize the story context as a spoken narration, then each individual dialogue turn. The resulting audio files are joined as a single recording.
  • Figure 3: Multi-task (MT) learning improves upon Single-task (ST) fine-tuning with both Gemini and Speech-Qwen on ASK-QA's multi-turn evaluation.
  • Figure 4: Our multi-task approach applied to Speech-Qwen outperforms the state-of-the-art approach on Spoken-SQuAD using only 10% of the available data.
  • Figure A5: Multi-turn evaluation pipeline for ASK-QA. A model is given an audio recording containing the spoken story and spoken conversation. It is tasked with providing the correct response. While the model response is a clarifying question (as determined by a prompted Action Classifier), the model-generated response is appended to a textual version of the conversation history and shown to a user simulator. The user simulator provides a coherent response to the clarifying question, and these two generated turns are synthesized using TTS to create a new spoken context. This process repeats until the model response is not a clarifying question.
  • ...and 1 more figures