Table of Contents
Fetching ...

BAST: Binaural Audio Spectrogram Transformer for Binaural Sound Localization

Sheng Kuang, Jie Shi, Kiki van der Heijden, Siamak Mehrkanoon

TL;DR

This work introduces BAST, a convolution-free, end-to-end Transformer model for binaural sound localization that processes left and right binaural spectrograms through dual encoders and a center encoder to predict 2D azimuth. It explores shared versus non-shared encoder weights and three interaural integration methods, optimizing with $AD$, $MSE$, or a hybrid loss to maximize angular accuracy in both anechoic and reverberant environments. BAST-NSP with subtraction integration and hybrid loss achieves the best performance ($AD$ ≈ 1.29°, $MSE$ ≈ 0.001), outperforming CNN- and ViT-based baselines and displaying symmetry across left/right hemifields and robust generalization across environments. Attention rollout analyses provide interpretable insights into the localization process, validating the model’s bilateral processing and integration mechanism. The results demonstrate the feasibility and effectiveness of binaural Transformers for real-world sound localization, with code and data publicly available.

Abstract

Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows barriers in capturing the global acoustic features. To address this issue, we propose a novel end-to-end Binaural Audio Spectrogram Transformer (BAST) model to predict the sound azimuth in both anechoic and reverberation environments. Two modes of implementation, i.e. BAST-SP and BAST-NSP corresponding to BAST model with shared and non-shared parameters respectively, are explored. Our model with subtraction interaural integration and hybrid loss achieves an angular distance of 1.29 degrees and a Mean Square Error of 1e-3 at all azimuths, significantly surpassing CNN based model. The exploratory analysis of the BAST's performance on the left-right hemifields and anechoic and reverberation environments shows its generalization ability as well as the feasibility of binaural Transformers in sound localization. Furthermore, the analysis of the attention maps is provided to give additional insights on the interpretation of the localization process in a natural reverberant environment.

BAST: Binaural Audio Spectrogram Transformer for Binaural Sound Localization

TL;DR

This work introduces BAST, a convolution-free, end-to-end Transformer model for binaural sound localization that processes left and right binaural spectrograms through dual encoders and a center encoder to predict 2D azimuth. It explores shared versus non-shared encoder weights and three interaural integration methods, optimizing with , , or a hybrid loss to maximize angular accuracy in both anechoic and reverberant environments. BAST-NSP with subtraction integration and hybrid loss achieves the best performance ( ≈ 1.29°, ≈ 0.001), outperforming CNN- and ViT-based baselines and displaying symmetry across left/right hemifields and robust generalization across environments. Attention rollout analyses provide interpretable insights into the localization process, validating the model’s bilateral processing and integration mechanism. The results demonstrate the feasibility and effectiveness of binaural Transformers for real-world sound localization, with code and data publicly available.

Abstract

Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows barriers in capturing the global acoustic features. To address this issue, we propose a novel end-to-end Binaural Audio Spectrogram Transformer (BAST) model to predict the sound azimuth in both anechoic and reverberation environments. Two modes of implementation, i.e. BAST-SP and BAST-NSP corresponding to BAST model with shared and non-shared parameters respectively, are explored. Our model with subtraction interaural integration and hybrid loss achieves an angular distance of 1.29 degrees and a Mean Square Error of 1e-3 at all azimuths, significantly surpassing CNN based model. The exploratory analysis of the BAST's performance on the left-right hemifields and anechoic and reverberation environments shows its generalization ability as well as the feasibility of binaural Transformers in sound localization. Furthermore, the analysis of the attention maps is provided to give additional insights on the interpretation of the localization process in a natural reverberant environment.
Paper Structure (17 sections, 3 equations, 6 figures, 3 tables)

This paper contains 17 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Architecture of the proposed Binaural Audio Spectrogram Transformer (BAST). (a) The architecture of the proposed model. (Here there are $N_{H}N_{T}$ number of patches). (b) The architecture of a single Transformer encoder. (c) Three examined interaural integration methods: concatenation, addition and subtraction.
  • Figure 2: The angular distance (AD) error of the proposed BAST in each azimuth with different loss functions and interaural integration methods.
  • Figure 3: The AD error of the proposed BAST-NSP and BAST-SP in the left and right hemifield. The boxplot indicates quartiles of the metric distribution with respect to azimuths. The asterisk between two boxes indicates the statistical significance (p$<$0.05, paired t-test with FDR correction) between the left and right hemifield.
  • Figure 4: The MSE of the proposed BAST-NSP and BAST-SP in the left and right hemifield. The boxplot indicates quartiles of the metric distribution with respect to azimuths. The asterisk between two boxes indicates the statistical significance (p$<$0.05, paired t-test with FDR correction) between the left and right hemifield.
  • Figure 5: An example of the attention matrices in the proposed model (i.e., BAST-NSP, hybrid loss and subtraction). The corresponding sound clip was randomly selected in the category of human speech with reverberation. For each layer, we present the patch-to-patch attention matrix (size: 180$\times$180) calculated by the rollout method in chefer2021transformer. Note that we initialize the attention matrix at the first layer of TE-C by summing the attention matrices at the last layer of TE-L and TE-R.
  • ...and 1 more figures