Table of Contents
Fetching ...

Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs

Zhe Sun, Yujun Cai, Jiayu Yao, Yiwei Wang

TL;DR

The paper tackles the problem of whether large audio–language systems can perceive auditory motion, a dimension often neglected in prior work. It introduces AMPBench, a physically grounded benchmark that uses binaural audio to test models on direction and trajectory inference through two QA formats (MCQ and TF) across multiple noise levels and motion variants. Across several state-of-the-art LALMs in zero-shot settings, the study finds a consistent motion perception deficit: average accuracy remains below 50% and performance is largely insensitive to audio quality, revealing a gap between semantic understanding and spatial reasoning. The work offers concrete diagnostics and guidelines—such as incorporating ITD/ILD cues and differentiable binaural rendering—and argues for training data and objectives that explicitly encode spatial physics to advance embodied auditory intelligence.

Abstract

Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AMPBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AMPBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.

Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs

TL;DR

The paper tackles the problem of whether large audio–language systems can perceive auditory motion, a dimension often neglected in prior work. It introduces AMPBench, a physically grounded benchmark that uses binaural audio to test models on direction and trajectory inference through two QA formats (MCQ and TF) across multiple noise levels and motion variants. Across several state-of-the-art LALMs in zero-shot settings, the study finds a consistent motion perception deficit: average accuracy remains below 50% and performance is largely insensitive to audio quality, revealing a gap between semantic understanding and spatial reasoning. The work offers concrete diagnostics and guidelines—such as incorporating ITD/ILD cues and differentiable binaural rendering—and argues for training data and objectives that explicitly encode spatial physics to advance embodied auditory intelligence.

Abstract

Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AMPBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AMPBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.

Paper Structure

This paper contains 16 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Illustration of the motion perception gap in LALMs. The upper-left panel depicts a tiger moving along four distinct trajectories around the model, and the right panels show the corresponding binaural waveforms. Each trajectory yields characteristic left–right patterns, including asymmetric amplitude envelopes, interaural intensity differences, and distance-dependent decay that jointly encode the underlying spatial motion. Despite the richness of these spatial cues, current LALMs fail to infer its direction or trajectory.
  • Figure 2: Examples from AMPBench, including (a) a multiple-choice question and (b) a true–false verification. Each item couples a synthesized binaural motion clip with a structured QA prompt for label-based evaluation.
  • Figure 3: Examples of True and False statements used in the T/F verification task.