Table of Contents
Fetching ...

Blind Estimation of Sub-band Acoustic Parameters from Ambisonics Recordings using Spectro-Spatial Covariance Features

Hanyu Meng, Jeroen Breebaart, Jeremy Stoddard, Vidhyasaharan Sethu, Eliathamby Ambikairajah

TL;DR

The paper tackles blind, frequency-dependent estimation of room acoustic parameters from FOA recordings, proposing the Spectro-Spatial Covariance Vector (SSCV) to encode temporal, spectral, and inter-channel information. A novel FOA-Conv3D back-end utilizes 3D convolutions over SSCV features to jointly capture time, frequency, and spatial cues, achieving lower errors and higher variance explained than single-channel approaches across 10 bands for $T_{60}$, DRR, and $C_{50}$. Evaluations on Spatial Librispeech-Lite show significant improvements with spatial information and recurrent architectures, establishing a new state-of-the-art for FOA-based blind estimation of frequency-varying acoustic parameters. The work enables more faithful, dynamic spatial audio rendering for VR/AR by providing robust multi-band estimates from FOA inputs and suggests avenues to extend to geometry and source orientation in future work.

Abstract

Estimating frequency-varying acoustic parameters is essential for enhancing immersive perception in realistic spatial audio creation. In this paper, we propose a unified framework that blindly estimates reverberation time (T60), direct-to-reverberant ratio (DRR), and clarity (C50) across 10 frequency bands using first-order Ambisonics (FOA) speech recordings as inputs. The proposed framework utilizes a novel feature named Spectro-Spatial Covariance Vector (SSCV), efficiently representing temporal, spectral as well as spatial information of the FOA signal. Our models significantly outperform existing single-channel methods with only spectral information, reducing estimation errors by more than half for all three acoustic parameters. Additionally, we introduce FOA-Conv3D, a novel back-end network for effectively utilising the SSCV feature with a 3D convolutional encoder. FOA-Conv3D outperforms the convolutional neural network (CNN) and recurrent convolutional neural network (CRNN) backends, achieving lower estimation errors and accounting for a higher proportion of variance (PoV) for all 3 acoustic parameters.

Blind Estimation of Sub-band Acoustic Parameters from Ambisonics Recordings using Spectro-Spatial Covariance Features

TL;DR

The paper tackles blind, frequency-dependent estimation of room acoustic parameters from FOA recordings, proposing the Spectro-Spatial Covariance Vector (SSCV) to encode temporal, spectral, and inter-channel information. A novel FOA-Conv3D back-end utilizes 3D convolutions over SSCV features to jointly capture time, frequency, and spatial cues, achieving lower errors and higher variance explained than single-channel approaches across 10 bands for , DRR, and . Evaluations on Spatial Librispeech-Lite show significant improvements with spatial information and recurrent architectures, establishing a new state-of-the-art for FOA-based blind estimation of frequency-varying acoustic parameters. The work enables more faithful, dynamic spatial audio rendering for VR/AR by providing robust multi-band estimates from FOA inputs and suggests avenues to extend to geometry and source orientation in future work.

Abstract

Estimating frequency-varying acoustic parameters is essential for enhancing immersive perception in realistic spatial audio creation. In this paper, we propose a unified framework that blindly estimates reverberation time (T60), direct-to-reverberant ratio (DRR), and clarity (C50) across 10 frequency bands using first-order Ambisonics (FOA) speech recordings as inputs. The proposed framework utilizes a novel feature named Spectro-Spatial Covariance Vector (SSCV), efficiently representing temporal, spectral as well as spatial information of the FOA signal. Our models significantly outperform existing single-channel methods with only spectral information, reducing estimation errors by more than half for all three acoustic parameters. Additionally, we introduce FOA-Conv3D, a novel back-end network for effectively utilising the SSCV feature with a 3D convolutional encoder. FOA-Conv3D outperforms the convolutional neural network (CNN) and recurrent convolutional neural network (CRNN) backends, achieving lower estimation errors and accounting for a higher proportion of variance (PoV) for all 3 acoustic parameters.

Paper Structure

This paper contains 15 sections, 6 equations, 4 figures.

Figures (4)

  • Figure 1: The problem context of this paper
  • Figure 2: The process of SSCV feature extraction
  • Figure 3: An Overview of the FOA based network structures applied in this paper (K:Kernel size, D:Dilation size)
  • Figure 4: Model performance metrics (MAE $\downarrow$, PoV $\uparrow$, PCC $\uparrow$; ordinate) for all three acoustic parameter estimation tasks (DRR, T60, C50) as a function of frequency (abscissa). Different curves represent different models (see legend).