Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs
Zhe Sun, Yujun Cai, Jiayu Yao, Yiwei Wang
TL;DR
The paper tackles the problem of whether large audio–language systems can perceive auditory motion, a dimension often neglected in prior work. It introduces AMPBench, a physically grounded benchmark that uses binaural audio to test models on direction and trajectory inference through two QA formats (MCQ and TF) across multiple noise levels and motion variants. Across several state-of-the-art LALMs in zero-shot settings, the study finds a consistent motion perception deficit: average accuracy remains below 50% and performance is largely insensitive to audio quality, revealing a gap between semantic understanding and spatial reasoning. The work offers concrete diagnostics and guidelines—such as incorporating ITD/ILD cues and differentiable binaural rendering—and argues for training data and objectives that explicitly encode spatial physics to advance embodied auditory intelligence.
Abstract
Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AMPBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AMPBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.
