What Are They Doing? Joint Audio-Speech Co-Reasoning

Yingzhi Wang; Pooneh Mousavi; Artem Ploujnikov; Mirco Ravanelli

What Are They Doing? Joint Audio-Speech Co-Reasoning

Yingzhi Wang, Pooneh Mousavi, Artem Ploujnikov, Mirco Ravanelli

TL;DR

This work targets the gap where audio and speech are processed separately in most systems. It introduces the Joint Audio-Speech Co-Reasoning (JASCO) task and the What Are They Doing dataset to enforce strict, cross-modal reasoning, along with a dual-encoder baseline and a Model-As-Judge evaluation framework. Experiments across four popular ALLMs reveal limited joint reasoning and clear modality biases, with some models leaning toward speech cues while others show partial integration. The paper provides a new diagnostic benchmark and open dataset to drive progress toward truly integrated audio-speech reasoning in large language models.

Abstract

In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are present in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we establish a novel benchmark to investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We also release a scene-reasoning dataset called "What Are They Doing". Additionally, we provide deeper insights into the models' behaviors by analyzing their dependence on each modality.

What Are They Doing? Joint Audio-Speech Co-Reasoning

TL;DR

Abstract

Paper Structure (12 sections, 2 figures, 2 tables)

This paper contains 12 sections, 2 figures, 2 tables.

Introduction
Joint Audio-Speech Co-Reasoning
Task Design
Baseline Model Architecture
Evaluation Metric
What Are They Doing Dataset
Dataset Design
Audio Preparation
Experiments
Experimental Setup
Results and Analysis
Conclusion

Figures (2)

Figure 1: The baseline model architecture for Joint Audio-Speech Co-Reasoning. Dual encoders are used to extract respectively acoustic and semantic information from the input audio clip, which are merged via a concatenation module. The fused embedding is concatenated with the instruction prompt embedding before being passed into an LLM with LoRA adaptors for co-reasoning.
Figure 2: Modality-dependence results of 4 popular ALLMs. The four sub-figures represent evaluations from three LLM judges and their average. Red, blue, and green are used to represent the proportion of audio-dependent (A), speech-dependent (S), and both-dependent (A-S) responses.

What Are They Doing? Joint Audio-Speech Co-Reasoning

TL;DR

Abstract

What Are They Doing? Joint Audio-Speech Co-Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (2)