Table of Contents
Fetching ...

SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models

S Sakshi, Vaibhavi Lokegaonkar, Neil Zhang, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha, Lie Lu

TL;DR

The paper tackles the lack of spatial reasoning in large audio-language models by introducing SPUR, a plug-in spatial encoder that processes First-Order Ambisonics inputs to produce rotation-aware embeddings for existing LALMs. It presents SPUR-Set, a spatial QA benchmark combining real and simulated FOA scenes to train and evaluate six spatial reasoning skills. The approach keeps the base LALMs frozen, fine-tuning only SPUR components and employing LoRA to inject spatial bias. Empirically, SPUR improves spatial QA and multi-speaker attribution while preserving non-spatial performance, demonstrating a practical path to spatially grounded audio–language models.

Abstract

Spatial perception is central to auditory intelligence, enabling accurate understanding of real-world acoustic scenes and advancing human-level perception of the world around us. While recent large audio-language models (LALMs) show strong reasoning over complex audios, most operate on monaural inputs and lack the ability to capture spatial cues such as direction, elevation, and distance. We introduce SPUR, a lightweight, plug-in approach that equips LALMs with spatial perception through minimal architectural changes. SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps (W, X, Y, Z) channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap for supervised spatial reasoning. Fine-tuning our model on the SPUR-Set consistently improves spatial QA and multi-speaker attribution while preserving general audio understanding. SPUR provides a simple recipe that transforms monaural LALMs into spatially aware models. Extensive ablations validate the effectiveness of our approach.

SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models

TL;DR

The paper tackles the lack of spatial reasoning in large audio-language models by introducing SPUR, a plug-in spatial encoder that processes First-Order Ambisonics inputs to produce rotation-aware embeddings for existing LALMs. It presents SPUR-Set, a spatial QA benchmark combining real and simulated FOA scenes to train and evaluate six spatial reasoning skills. The approach keeps the base LALMs frozen, fine-tuning only SPUR components and employing LoRA to inject spatial bias. Empirically, SPUR improves spatial QA and multi-speaker attribution while preserving non-spatial performance, demonstrating a practical path to spatially grounded audio–language models.

Abstract

Spatial perception is central to auditory intelligence, enabling accurate understanding of real-world acoustic scenes and advancing human-level perception of the world around us. While recent large audio-language models (LALMs) show strong reasoning over complex audios, most operate on monaural inputs and lack the ability to capture spatial cues such as direction, elevation, and distance. We introduce SPUR, a lightweight, plug-in approach that equips LALMs with spatial perception through minimal architectural changes. SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps (W, X, Y, Z) channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap for supervised spatial reasoning. Fine-tuning our model on the SPUR-Set consistently improves spatial QA and multi-speaker attribution while preserving general audio understanding. SPUR provides a simple recipe that transforms monaural LALMs into spatially aware models. Extensive ablations validate the effectiveness of our approach.

Paper Structure

This paper contains 19 sections, 11 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Illustration of our proposed SPUR approach for spatial LALMs. SPUR introduces spatial awareness into existing LALM encoders by converting multi-channel FOA inputs into geometry-aware embeddings. We first extract spatial covariance features through banded covariance computation, one-pole temporal smoothing, and real-valued vectorization. We then project these spatial features via convolution, patching, and transformer blocks into the audio encoder’s embedding space. The adapted spatial embeddings are then passed through a projector into the LLM. Only the SPUR-Encoder, MLP, and LoRA layers are fine-tuned, while the base audio encoder and LLM remain frozen.
  • Figure 2: Overview of the SPUR-Set curation pipeline and example tasks. The left panel illustrates the multi-stage pipeline used to construct SPUR-Set. Multi-channel FOA recordings are selected, transcribed with Whisper, captioned with an LALM, and paired with spatial metadata (sound events, elevation, distance). This is followed by passing this information to a frontier text-only LLM to produce skill-oriented question–answer pairs. A part of the outputs undergo human verification for SPUR-Set-Test. The right panel presents representative examples across the six reasoning skill categories in SPUR-Set.
  • Figure 3: Azimuth and elevation angle distributions in the train set, displaying source directions relative to the listener.
  • Figure 4: Class-wise Azimuth Angle distribution
  • Figure 5: Class-wise Elevation Angle distribution
  • ...and 8 more figures