Table of Contents
Fetching ...

OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models

Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam

TL;DR

This work tackles spatial reasoning in audio-language models by introducing SAGE, a geometry-aware encoder that leverages binaural cues fused with panoramic depth and RIR supervision during training, while inference uses only audio. Building on SAGE, OWL integrates a geometry-grounded encoder with a large language model to perform multi-step spatial reasoning and produce interpretable chain-of-thought rationales. The authors release BiDepth, a large synthetic dataset linking binaural audio, RIRs, depth images, and QA/CoT annotations to support geometry-aware training and evaluation. Empirically, SAGE improves DoA accuracy and localization robustness, and OWL achieves state-of-the-art performance on perceptual QA and spatial reasoning across SpatialSoundQA and BiDepth, demonstrating the value of geometry grounding and CoT supervision for audio-LLMs. The work also discusses limitations of simulation-based data and outlines directions toward real-world data, interactive dialogue, and richer multimodal grounding.

Abstract

Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpretable reasoning. Recent work such as BAT demonstrates spatial QA with binaural audio, but its reliance on coarse categorical labels (left, right, up, down) and the absence of explicit geometric supervision constrain resolution and robustness. We introduce the $\textbf{Spatial-Acoustic Geometry Encoder (SAGE}$), a geometry-aware audio encoder that aligns binaural acoustic features with 3D spatial structure using panoramic depth images and room-impulse responses at training time, while requiring only audio at inference. Building on this representation, we present $\textbf{OWL}$, an ALLM that integrates $\textbf{SAGE}$ with a spatially grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and distance estimates. Through curriculum learning from perceptual QA to multi-step reasoning, $\textbf{OWL}$ supports o'clock-level azimuth and DoA estimation. To enable large-scale training and evaluation, we construct and release $\textbf{BiDepth}$, a dataset of over one million QA pairs combining binaural audio with panoramic depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new $\textbf{BiDepth}$ and the public SpatialSoundQA, $\textbf{OWL}$ reduces mean DoA error by $\textbf{11$^{\circ}$}$ through $\textbf{SAGE}$ and improves spatial reasoning QA accuracy by up to $\textbf{25}$\% over BAT.

OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models

TL;DR

This work tackles spatial reasoning in audio-language models by introducing SAGE, a geometry-aware encoder that leverages binaural cues fused with panoramic depth and RIR supervision during training, while inference uses only audio. Building on SAGE, OWL integrates a geometry-grounded encoder with a large language model to perform multi-step spatial reasoning and produce interpretable chain-of-thought rationales. The authors release BiDepth, a large synthetic dataset linking binaural audio, RIRs, depth images, and QA/CoT annotations to support geometry-aware training and evaluation. Empirically, SAGE improves DoA accuracy and localization robustness, and OWL achieves state-of-the-art performance on perceptual QA and spatial reasoning across SpatialSoundQA and BiDepth, demonstrating the value of geometry grounding and CoT supervision for audio-LLMs. The work also discusses limitations of simulation-based data and outlines directions toward real-world data, interactive dialogue, and richer multimodal grounding.

Abstract

Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpretable reasoning. Recent work such as BAT demonstrates spatial QA with binaural audio, but its reliance on coarse categorical labels (left, right, up, down) and the absence of explicit geometric supervision constrain resolution and robustness. We introduce the ), a geometry-aware audio encoder that aligns binaural acoustic features with 3D spatial structure using panoramic depth images and room-impulse responses at training time, while requiring only audio at inference. Building on this representation, we present , an ALLM that integrates with a spatially grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and distance estimates. Through curriculum learning from perceptual QA to multi-step reasoning, supports o'clock-level azimuth and DoA estimation. To enable large-scale training and evaluation, we construct and release , a dataset of over one million QA pairs combining binaural audio with panoramic depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new and the public SpatialSoundQA, reduces mean DoA error by ^{\circ} through and improves spatial reasoning QA accuracy by up to \% over BAT.

Paper Structure

This paper contains 30 sections, 17 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: SAGE encodes binaural audio into spatially grounded representations. OWL detects events, localizes by direction and distance, and applies multi-step reasoning, yielding interpretable rationales for queries such as "Which sound source is left of the listener?"
  • Figure 2: Example of paired modalities in BiDepth. Left: panoramic depth image $\mathbf{D_i}$ capturing geometric context from the listener's perspective. Right: binaural acoustic simulation, where a sound source $\mathbf{s(x, y, z, \theta)}$ is rendered at a position $\mathbf{s(x', y', z')}$ relative to the listener.
  • Figure 3: Azimuth and elevation angle distributions in BiDepth, , showing source directions relative to the listener. Azimuths are nearly uniform, while elevations cluster near the horizontal plane.
  • Figure 4: Architecture of OWL and SAGE. The left panel shows SAGE, trained with geometry-aware supervision using RIRs and depth cues. The right panel illustrates the OWL pipeline, where the Binaural Audio Encoder $\mathbf{\phi_a(\cdot)}$ is combined with the LLM $\mathbf{\Pi}$ through a projector $\mathbf{\psi(\cdot)}$ to generate spatially grounded answers. Here, and represent trainable and frozen components, respectively.
  • Figure 5: Distributions of azimuth, elevation, and source-receiver distance in BiDepth. Azimuth angles are nearly uniform, elevation is skewed toward the horizontal plane (reflecting typical indoor acoustics), and distances peak around 1.8 m within a 10 m range. The dataset, comprising 28K binaural RIRs and 1.1M QA pairs, will be made publicly available to ensure reproducibility and facilitate further research.
  • ...and 6 more figures