What Are They Doing? Joint Audio-Speech Co-Reasoning
Yingzhi Wang, Pooneh Mousavi, Artem Ploujnikov, Mirco Ravanelli
TL;DR
This work targets the gap where audio and speech are processed separately in most systems. It introduces the Joint Audio-Speech Co-Reasoning (JASCO) task and the What Are They Doing dataset to enforce strict, cross-modal reasoning, along with a dual-encoder baseline and a Model-As-Judge evaluation framework. Experiments across four popular ALLMs reveal limited joint reasoning and clear modality biases, with some models leaning toward speech cues while others show partial integration. The paper provides a new diagnostic benchmark and open dataset to drive progress toward truly integrated audio-speech reasoning in large language models.
Abstract
In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are present in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we establish a novel benchmark to investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We also release a scene-reasoning dataset called "What Are They Doing". Additionally, we provide deeper insights into the models' behaviors by analyzing their dependence on each modality.
