Table of Contents
Fetching ...

Sci-Phi: A Large Language Model Spatial Audio Descriptor

Xilin Jiang, Hannes Gamper, Sebastian Braun

TL;DR

Sci-Phi presents the first spatial-audio large language model capable of full spatial-scene description by coupling a spatial encoder with an audio encoder to generate structured scene metadata for multiple sources and room acoustics. Trained on over 4,000 hours of synthetic first-order Ambisonics data, Sci-Phi generalizes to real room impulse responses with modest degradation and is evaluated via a permutation-invariant protocol across 15 metrics. The approach advances beyond single-source or mono-channel LLMs, enabling comprehensive What/Where/When descriptions and robust performance under varying acoustic conditions, with scalability to denser scenes and a spatial Q&A extension. This work has strong implications for hearing assistive devices, robotics, and spatial environment annotation, bringing audio foundation models closer to real-world, open-ended spatial reasoning.

Abstract

Acoustic scene perception involves describing the type of sounds, their timing, their direction and distance, as well as their loudness and reverberation. While audio language models excel in sound recognition, single-channel input fundamentally limits spatial understanding. This work presents Sci-Phi, a spatial audio large language model with dual spatial and spectral encoders that estimates a complete parameter set for all sound sources and the surrounding environment. Learning from over 4,000 hours of synthetic first-order Ambisonics recordings including metadata, Sci-Phi enumerates and describes up to four directional sound sources in one pass, alongside non-directional background sounds and room characteristics. We evaluate the model with a permutation-invariant protocol and 15 metrics covering content, location, timing, loudness, and reverberation, and analyze its robustness across source counts, signal-to-noise ratios, reverberation levels, and challenging mixtures of acoustically, spatially, or temporally similar sources. Notably, Sci-Phi generalizes to real room impulse responses with only minor performance degradation. Overall, this work establishes the first audio LLM capable of full spatial-scene description, with strong potential for real-world deployment. Demo: https://sci-phi-audio.github.io/demo

Sci-Phi: A Large Language Model Spatial Audio Descriptor

TL;DR

Sci-Phi presents the first spatial-audio large language model capable of full spatial-scene description by coupling a spatial encoder with an audio encoder to generate structured scene metadata for multiple sources and room acoustics. Trained on over 4,000 hours of synthetic first-order Ambisonics data, Sci-Phi generalizes to real room impulse responses with modest degradation and is evaluated via a permutation-invariant protocol across 15 metrics. The approach advances beyond single-source or mono-channel LLMs, enabling comprehensive What/Where/When descriptions and robust performance under varying acoustic conditions, with scalability to denser scenes and a spatial Q&A extension. This work has strong implications for hearing assistive devices, robotics, and spatial environment annotation, bringing audio foundation models closer to real-world, open-ended spatial reasoning.

Abstract

Acoustic scene perception involves describing the type of sounds, their timing, their direction and distance, as well as their loudness and reverberation. While audio language models excel in sound recognition, single-channel input fundamentally limits spatial understanding. This work presents Sci-Phi, a spatial audio large language model with dual spatial and spectral encoders that estimates a complete parameter set for all sound sources and the surrounding environment. Learning from over 4,000 hours of synthetic first-order Ambisonics recordings including metadata, Sci-Phi enumerates and describes up to four directional sound sources in one pass, alongside non-directional background sounds and room characteristics. We evaluate the model with a permutation-invariant protocol and 15 metrics covering content, location, timing, loudness, and reverberation, and analyze its robustness across source counts, signal-to-noise ratios, reverberation levels, and challenging mixtures of acoustically, spatially, or temporally similar sources. Notably, Sci-Phi generalizes to real room impulse responses with only minor performance degradation. Overall, this work establishes the first audio LLM capable of full spatial-scene description, with strong potential for real-world deployment. Demo: https://sci-phi-audio.github.io/demo

Paper Structure

This paper contains 10 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Sci-Phi architecture, derived from Phi-4-Multimodal (visual components not shown for clarity). Fire and snowflake mark the trainable and frozen components. Light red, blue, and grey colors correspond to spatial, spectral, and textual features, modules, embeddings, and computation flow.
  • Figure 2: Spatial audio analysis results of the synthetic-RIR (solid) and real-RIR (empty) test sets. Each subplot is one evaluation metric. Green/Red indicates higher/lower is better. Note: room volume error and noise CLAP are missing for real-RIR test due to a lack of ground-truths in FOA-MEIR FOA-MEIR.
  • Figure 3: Confusion matrices of single-source localization from the synthetic-RIR (S1) and real-RIR (R1) test sets. Note that we only show a shorter confusion matrix R1, because all source elevations of FOA-MEIR are within $[-22.5^{\circ}, 22.5^{\circ}]$ (horizontal label by elevation thresholds), although Sci-Phi was trained for and may predict all elevations (horizontal, upper, or lower).
  • Figure 4: Confusion matrices of source counting from the synthetic-RIR (S2) and real-RIR (R2) test sets.
  • Figure 5: Environmental robustness of Sci-Phi on the real-RIR test set. A--B: expected behavior across decreasing SNR and increasing reverberation; C--E: consistent performance even when sources are acoustically, spatially, or temporally close; F: reliable recognition of brief ($\sim1s$) sources, with longer durations providing added context. (The marker position and the shaded area correspond to mean $\pm$ std in D--F.)
  • ...and 1 more figures