Table of Contents
Fetching ...

GRAM: Spatial general-purpose audio representation models for real-world applications

Goksenin Yuksel, Marcel van Gerven, Kiki van der Heijden

TL;DR

GRAM introduces a General-purpose Real-world Audio Model that learns spatial audio representations via a multi-channel masked auto-encoder trained on simulated real-world scenes. It addresses the gap in existing audio foundation models by incorporating spatial cues, reverberation, and noise, and they release Nat-HEAR to evaluate spatial tasks under naturalistic conditions. The approach achieves state-of-the-art performance on HEAR and Nat-HEAR with substantially less training data, and delivers exceptional sound localization, even surpassing some supervised methods, while supporting both binaural and Ambisonics inputs. This work demonstrates strong transfer to real-world recordings and lays the groundwork for robust, spatially-aware audio foundations in real-world applications.

Abstract

Although audio foundations models have seen great progress on a wide variety of tasks, their application in real-world acoustic environments with reverberation and noise has been less successful. Moreover, as audio foundation models are typically trained on dry, single-channel audio clips, the inherent spatial nature of real-world sound scenes is overlooked and tasks involving sound localization ruled out. To address these limitations, we propose GRAM: a General-purpose Real-world Audio Model utilizing a multi-channel masked auto-encoder approach to efficiently learn spatial audio representations from high-quality simulated real-world scenes. To evaluate the performance of GRAM and other audio foundation models in real-world sound scenes, we release Nat-HEAR: A naturalistic version of the HEAR benchmark suite comprising a simulated real-world version, as well as two new sound localization tasks. We show that the performance of GRAM surpasses all state-of-the-art self-supervised audio foundation models and speech models on both HEAR and Nat-HEAR, while using only a fraction of the training data. GRAM also showcases state-of-the-art localization performance, surpassing even supervised sound localization approaches, and can be flexibly applied either to a two-channel, binaural sound format or a four-channel, Ambisonics format. Validating GRAM's performance on real-world sound recordings demonstrates robust transfer to real-world scenes. Taken together, GRAM presents a significant advancement towards robust, spatial audio foundation models for real-world applications.

GRAM: Spatial general-purpose audio representation models for real-world applications

TL;DR

GRAM introduces a General-purpose Real-world Audio Model that learns spatial audio representations via a multi-channel masked auto-encoder trained on simulated real-world scenes. It addresses the gap in existing audio foundation models by incorporating spatial cues, reverberation, and noise, and they release Nat-HEAR to evaluate spatial tasks under naturalistic conditions. The approach achieves state-of-the-art performance on HEAR and Nat-HEAR with substantially less training data, and delivers exceptional sound localization, even surpassing some supervised methods, while supporting both binaural and Ambisonics inputs. This work demonstrates strong transfer to real-world recordings and lays the groundwork for robust, spatially-aware audio foundations in real-world applications.

Abstract

Although audio foundations models have seen great progress on a wide variety of tasks, their application in real-world acoustic environments with reverberation and noise has been less successful. Moreover, as audio foundation models are typically trained on dry, single-channel audio clips, the inherent spatial nature of real-world sound scenes is overlooked and tasks involving sound localization ruled out. To address these limitations, we propose GRAM: a General-purpose Real-world Audio Model utilizing a multi-channel masked auto-encoder approach to efficiently learn spatial audio representations from high-quality simulated real-world scenes. To evaluate the performance of GRAM and other audio foundation models in real-world sound scenes, we release Nat-HEAR: A naturalistic version of the HEAR benchmark suite comprising a simulated real-world version, as well as two new sound localization tasks. We show that the performance of GRAM surpasses all state-of-the-art self-supervised audio foundation models and speech models on both HEAR and Nat-HEAR, while using only a fraction of the training data. GRAM also showcases state-of-the-art localization performance, surpassing even supervised sound localization approaches, and can be flexibly applied either to a two-channel, binaural sound format or a four-channel, Ambisonics format. Validating GRAM's performance on real-world sound recordings demonstrates robust transfer to real-world scenes. Taken together, GRAM presents a significant advancement towards robust, spatial audio foundation models for real-world applications.

Paper Structure

This paper contains 19 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Proposed self-supervised approach for training GRAMs on binaural spectrograms. The Patch Extraction layer patches and embeds the input spectrogram using 2D convolution. A random subset of patches is masked (ratio = 0.8). Unmasked patches are fed to the encoder. The decoder takes the encoder outputs padded with the masked patches and reconstructs the original spectrogram. For the ambisonics spectrograms, the methodology stays the same except that inputs now contains 4 channel mel spectrograms, and intensity vectors (IVs).
  • Figure 2: Downstream model performance on naturalistic sound scenes. (A) Nat-HEAR and HEAR downstream performance as a function of training data quantity. (B) Box plots of the difference in performance on HEAR and Nat-HEAR, excluding the DCASE-2016 task. Box limits reflect the first and third quartile, center line the median (see also Table \ref{['tab:dry_audio_benchmarks']}).
  • Figure 3: Localizing sound sources in simulated real-world sound scenes. (A) Boxplots of direction of arrival (DoA) error. Box limits: first and third quartile; center line: median; whiskers: 1.5 times the interquartile range. (B) Confusion matrices for GRAM-Ambisonics on the SC-5 localization task. Azimuth: 0°= forward, +90°= right. Elevation: -90°= down, +90°= up.
  • Figure 4: Ablation Studies. Effect of hyperparameters on HEAR and Nat-HEAR Performance. From left to right; (1) GRAM-Binaural downstream performance as a function of the ratio $\lambda$ between clean and naturalistic scenes in the pre-training data. (2) GRAM-Ambisonics downstream performance as a function of the ratio $\lambda$ between clean and naturalistic scenes in the pre-training data. (3) effect of masking strategy to downstream performance for GRAM-Binaural (4) comparison of Mamba and Transformer architectures on binaural training data. Important to note that architectures depicted in (4) was trained on reduced batch size (96 $\rightarrow$ 32).
  • Figure 5: Additional ablation studies. Effect of hyperparameters on HEAR and Nat-HEAR Performance. From left to right; (1) GRAM-Binaural downstream performance as a function of the number of in batch samples. (2) The effect of masking ratio for GRAM-Binaural. Important to note that GRAM-Binaural depicted in (2) was trained on reduced number of samples (16 $\rightarrow$ 4).
  • ...and 1 more figures