Table of Contents
Fetching ...

Deep learning based spatial aliasing reduction in beamforming for audio capture

Mateusz Guzik, Giulio Cengarle, Daniel Arteaga

TL;DR

Spatial aliasing limits beamforming performance in spaced microphone arrays by introducing direction-dependent distortions at higher frequencies. The authors propose a U-Net that predicts a multichannel de-aliasing filter $\mathbf{F}_{ft}$ to apply to virtual microphone signals, with a fixed decoder $\mathbf{D}$ to yield alias-free outputs; supervision uses an alias-free encoder $\mathbf{E}$ and the PHASEN loss to align the de-aliased output with the target. Two experimental setups—stereo via cardioid pairs and 2D FOA decoding—demonstrate substantial objective (C-Si-SNR) and subjective (MUSHRA) improvements over conventional beamforming, including robustness to fixed and varying microphone spacings. The results indicate cross-channel interactions help in some setups, while the framework remains adaptable to different spatial audio pipelines and decoders, with future extensions to reverberation and frequency-dependent polar patterns. Overall, the work demonstrates that deep learning can effectively mitigate spatial aliasing in beamforming, enabling higher-fidelity audio capture in multi-microphone configurations.

Abstract

Spatial aliasing affects spaced microphone arrays, causing directional ambiguity above certain frequencies, degrading spatial and spectral accuracy of beamformers. Given the limitations of conventional signal processing and the scarcity of deep learning approaches to spatial aliasing mitigation, we propose a novel approach using a U-Net architecture to predict a signal-dependent de-aliasing filter, which reduces aliasing in conventional beamforming for spatial capture. Two types of multichannel filters are considered, one which treats the channels independently and a second one that models cross-channel dependencies. The proposed approach is evaluated in two common spatial capture scenarios: stereo and first-order Ambisonics. The results indicate a very significant improvement, both objective and perceptual, with respect to conventional beamforming. This work shows the potential of deep learning to reduce aliasing in beamforming, leading to improvements in multi-microphone setups.

Deep learning based spatial aliasing reduction in beamforming for audio capture

TL;DR

Spatial aliasing limits beamforming performance in spaced microphone arrays by introducing direction-dependent distortions at higher frequencies. The authors propose a U-Net that predicts a multichannel de-aliasing filter to apply to virtual microphone signals, with a fixed decoder to yield alias-free outputs; supervision uses an alias-free encoder and the PHASEN loss to align the de-aliased output with the target. Two experimental setups—stereo via cardioid pairs and 2D FOA decoding—demonstrate substantial objective (C-Si-SNR) and subjective (MUSHRA) improvements over conventional beamforming, including robustness to fixed and varying microphone spacings. The results indicate cross-channel interactions help in some setups, while the framework remains adaptable to different spatial audio pipelines and decoders, with future extensions to reverberation and frequency-dependent polar patterns. Overall, the work demonstrates that deep learning can effectively mitigate spatial aliasing in beamforming, enabling higher-fidelity audio capture in multi-microphone configurations.

Abstract

Spatial aliasing affects spaced microphone arrays, causing directional ambiguity above certain frequencies, degrading spatial and spectral accuracy of beamformers. Given the limitations of conventional signal processing and the scarcity of deep learning approaches to spatial aliasing mitigation, we propose a novel approach using a U-Net architecture to predict a signal-dependent de-aliasing filter, which reduces aliasing in conventional beamforming for spatial capture. Two types of multichannel filters are considered, one which treats the channels independently and a second one that models cross-channel dependencies. The proposed approach is evaluated in two common spatial capture scenarios: stereo and first-order Ambisonics. The results indicate a very significant improvement, both objective and perceptual, with respect to conventional beamforming. This work shows the potential of deep learning to reduce aliasing in beamforming, leading to improvements in multi-microphone setups.

Paper Structure

This paper contains 10 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Polar representation of spatially aliased conventional beamforming. The curves are in arbitrary logarithmic scale.
  • Figure 2: Spatial responses of the left- and right-facing cardioids of experiment i) fix, respectively in the top and bottom row, in four frequency bands.
  • Figure 3: Spatial responses of the decoded front- and left-facing cardioids of experiment ii) fix, respectively in the top and bottom row, in four frequency bands.
  • Figure 4: Box plots for MUSHRA scores across different conditions and tests. Each point identifies the average rating for all test excerpts for one given listener.