Sci-Phi: A Large Language Model Spatial Audio Descriptor
Xilin Jiang, Hannes Gamper, Sebastian Braun
TL;DR
Sci-Phi presents the first spatial-audio large language model capable of full spatial-scene description by coupling a spatial encoder with an audio encoder to generate structured scene metadata for multiple sources and room acoustics. Trained on over 4,000 hours of synthetic first-order Ambisonics data, Sci-Phi generalizes to real room impulse responses with modest degradation and is evaluated via a permutation-invariant protocol across 15 metrics. The approach advances beyond single-source or mono-channel LLMs, enabling comprehensive What/Where/When descriptions and robust performance under varying acoustic conditions, with scalability to denser scenes and a spatial Q&A extension. This work has strong implications for hearing assistive devices, robotics, and spatial environment annotation, bringing audio foundation models closer to real-world, open-ended spatial reasoning.
Abstract
Acoustic scene perception involves describing the type of sounds, their timing, their direction and distance, as well as their loudness and reverberation. While audio language models excel in sound recognition, single-channel input fundamentally limits spatial understanding. This work presents Sci-Phi, a spatial audio large language model with dual spatial and spectral encoders that estimates a complete parameter set for all sound sources and the surrounding environment. Learning from over 4,000 hours of synthetic first-order Ambisonics recordings including metadata, Sci-Phi enumerates and describes up to four directional sound sources in one pass, alongside non-directional background sounds and room characteristics. We evaluate the model with a permutation-invariant protocol and 15 metrics covering content, location, timing, loudness, and reverberation, and analyze its robustness across source counts, signal-to-noise ratios, reverberation levels, and challenging mixtures of acoustically, spatially, or temporally similar sources. Notably, Sci-Phi generalizes to real room impulse responses with only minor performance degradation. Overall, this work establishes the first audio LLM capable of full spatial-scene description, with strong potential for real-world deployment. Demo: https://sci-phi-audio.github.io/demo
