GRAM: Spatial general-purpose audio representation models for real-world applications
Goksenin Yuksel, Marcel van Gerven, Kiki van der Heijden
TL;DR
GRAM introduces a General-purpose Real-world Audio Model that learns spatial audio representations via a multi-channel masked auto-encoder trained on simulated real-world scenes. It addresses the gap in existing audio foundation models by incorporating spatial cues, reverberation, and noise, and they release Nat-HEAR to evaluate spatial tasks under naturalistic conditions. The approach achieves state-of-the-art performance on HEAR and Nat-HEAR with substantially less training data, and delivers exceptional sound localization, even surpassing some supervised methods, while supporting both binaural and Ambisonics inputs. This work demonstrates strong transfer to real-world recordings and lays the groundwork for robust, spatially-aware audio foundations in real-world applications.
Abstract
Although audio foundations models have seen great progress on a wide variety of tasks, their application in real-world acoustic environments with reverberation and noise has been less successful. Moreover, as audio foundation models are typically trained on dry, single-channel audio clips, the inherent spatial nature of real-world sound scenes is overlooked and tasks involving sound localization ruled out. To address these limitations, we propose GRAM: a General-purpose Real-world Audio Model utilizing a multi-channel masked auto-encoder approach to efficiently learn spatial audio representations from high-quality simulated real-world scenes. To evaluate the performance of GRAM and other audio foundation models in real-world sound scenes, we release Nat-HEAR: A naturalistic version of the HEAR benchmark suite comprising a simulated real-world version, as well as two new sound localization tasks. We show that the performance of GRAM surpasses all state-of-the-art self-supervised audio foundation models and speech models on both HEAR and Nat-HEAR, while using only a fraction of the training data. GRAM also showcases state-of-the-art localization performance, surpassing even supervised sound localization approaches, and can be flexibly applied either to a two-channel, binaural sound format or a four-channel, Ambisonics format. Validating GRAM's performance on real-world sound recordings demonstrates robust transfer to real-world scenes. Taken together, GRAM presents a significant advancement towards robust, spatial audio foundation models for real-world applications.
