Table of Contents
Fetching ...

Spatial Audio Question Answering and Reasoning on Dynamic Source Movements

Arvind Krishna Sridhar, Yinyi Guo, Erik Visser

TL;DR

This work addresses the challenge of Spatial AQA for dynamic, moving sound sources by introducing movement-centric data augmentation, an end-to-end spatial audio-language model with a thinking mode, and query-conditioned source separation via an Audio Grounding Model. It provides a movement-focused SAQA dataset generated from AudioSet, an architecture that produces intermediate reasoning steps before answering, and an evaluation showing that combining high-quality source separation with reasoning yields the largest gains, especially for single-event queries. The study reveals the critical interplay between motion modeling, reasoning processes, and separation quality, offering practical insights for improving spatial audio understanding in realistic, dynamic scenes. Future directions include extending to more complex real-world environments and multi-source recordings with stronger temporal priors for spatial reasoning.

Abstract

Spatial audio understanding aims to enable machines to interpret complex auditory scenes, particularly when sound sources move over time. In this work, we study Spatial Audio Question Answering (Spatial AQA) with a focus on movement reasoning, where a model must infer object motion, position, and directional changes directly from stereo audio. First, we introduce a movement-centric spatial audio augmentation framework that synthesizes diverse motion patterns from isolated mono audio events, enabling controlled and scalable training data generation. Second, we propose an end-to-end multimodal finetuning approach with a thinking mode, which allows audio-language models to produce explicit intermediate reasoning steps before predicting an answer. Third, we investigate the impact of query-conditioned source separation as a preprocessing stage and compare three inference regimes: no masking, an audio grounding model (AGM), and ground-truth masks. Our results show that reasoning amplifies the benefits of source separation, with thinking mode showing significant improvement of +5.1% when a single event is present in the question. These findings highlight the interplay between movement modeling, reasoning, and separation quality, offering new insights for advancing spatial audio understanding.

Spatial Audio Question Answering and Reasoning on Dynamic Source Movements

TL;DR

This work addresses the challenge of Spatial AQA for dynamic, moving sound sources by introducing movement-centric data augmentation, an end-to-end spatial audio-language model with a thinking mode, and query-conditioned source separation via an Audio Grounding Model. It provides a movement-focused SAQA dataset generated from AudioSet, an architecture that produces intermediate reasoning steps before answering, and an evaluation showing that combining high-quality source separation with reasoning yields the largest gains, especially for single-event queries. The study reveals the critical interplay between motion modeling, reasoning processes, and separation quality, offering practical insights for improving spatial audio understanding in realistic, dynamic scenes. Future directions include extending to more complex real-world environments and multi-source recordings with stronger temporal priors for spatial reasoning.

Abstract

Spatial audio understanding aims to enable machines to interpret complex auditory scenes, particularly when sound sources move over time. In this work, we study Spatial Audio Question Answering (Spatial AQA) with a focus on movement reasoning, where a model must infer object motion, position, and directional changes directly from stereo audio. First, we introduce a movement-centric spatial audio augmentation framework that synthesizes diverse motion patterns from isolated mono audio events, enabling controlled and scalable training data generation. Second, we propose an end-to-end multimodal finetuning approach with a thinking mode, which allows audio-language models to produce explicit intermediate reasoning steps before predicting an answer. Third, we investigate the impact of query-conditioned source separation as a preprocessing stage and compare three inference regimes: no masking, an audio grounding model (AGM), and ground-truth masks. Our results show that reasoning amplifies the benefits of source separation, with thinking mode showing significant improvement of +5.1% when a single event is present in the question. These findings highlight the interplay between movement modeling, reasoning, and separation quality, offering new insights for advancing spatial audio understanding.
Paper Structure (33 sections, 2 figures, 9 tables)