Table of Contents
Fetching ...

Loudspeaker Beamforming to Enhance Speech Recognition Performance of Voice Driven Applications

Dimme de Groot, Baturalp Karslioglu, Odette Scharenborg, Jorge Martinez

TL;DR

The paper tackles ASR robustness in voice-driven applications when loudspeakers become a major source of interference. It introduces loudspeaker spotforming (LSp), a near-field beamforming approach that shapes playback signals to create a low-energy zone around the VDA while limiting perceptual distortion via an auditory-masking-based constraint, formulated as a convex optimization solved with CVX. The method relies on a region-based spatial covariance model, using a probability distribution p_\mathcal{M} to describe likely microphone positions and optionally including late reverberation through R_\text{iso}. Experimental results in simulations and real rooms show significant energy reduction near the VDA with minimal degradation in listener quality and clear improvements in ASR performance (WER/WIL) under adverse SIRs, validating the approach and its robustness. A key contribution is the demonstrated trade-off controlled by the distortion parameter d, enabling practical deployment, though computational complexity remains a bottleneck for real-time operation.

Abstract

In this paper we propose a robust loudspeaker beamforming algorithm which is used to enhance the performance of voice driven applications in scenarios where the loudspeakers introduce the majority of the noise, e.g. when music is playing loudly. The loudspeaker beamformer modifies the loudspeaker playback signals to create a low-acoustic-energy region around the device that implements automatic speech recognition for a voice driven application (VDA). The algorithm utilises a distortion measure based on human auditory perception to limit the distortion perceived by human listeners. Simulations and real-world experiments show that the proposed loudspeaker beamformer improves the speech recognition performance in all tested scenarios. Moreover, the algorithm allows to further reduce the acoustic energy around the VDA device at the expense of reduced objective audio quality at the listener's location.

Loudspeaker Beamforming to Enhance Speech Recognition Performance of Voice Driven Applications

TL;DR

The paper tackles ASR robustness in voice-driven applications when loudspeakers become a major source of interference. It introduces loudspeaker spotforming (LSp), a near-field beamforming approach that shapes playback signals to create a low-energy zone around the VDA while limiting perceptual distortion via an auditory-masking-based constraint, formulated as a convex optimization solved with CVX. The method relies on a region-based spatial covariance model, using a probability distribution p_\mathcal{M} to describe likely microphone positions and optionally including late reverberation through R_\text{iso}. Experimental results in simulations and real rooms show significant energy reduction near the VDA with minimal degradation in listener quality and clear improvements in ASR performance (WER/WIL) under adverse SIRs, validating the approach and its robustness. A key contribution is the demonstrated trade-off controlled by the distortion parameter d, enabling practical deployment, though computational complexity remains a bottleneck for real-time operation.

Abstract

In this paper we propose a robust loudspeaker beamforming algorithm which is used to enhance the performance of voice driven applications in scenarios where the loudspeakers introduce the majority of the noise, e.g. when music is playing loudly. The loudspeaker beamformer modifies the loudspeaker playback signals to create a low-acoustic-energy region around the device that implements automatic speech recognition for a voice driven application (VDA). The algorithm utilises a distortion measure based on human auditory perception to limit the distortion perceived by human listeners. Simulations and real-world experiments show that the proposed loudspeaker beamformer improves the speech recognition performance in all tested scenarios. Moreover, the algorithm allows to further reduce the acoustic energy around the VDA device at the expense of reduced objective audio quality at the listener's location.
Paper Structure (10 sections, 12 equations, 4 figures)

This paper contains 10 sections, 12 equations, 4 figures.

Figures (4)

  • Figure 1: The loudspeaker spotformer (LSp) setup. In (a), a topview schematic of the setup is shown. The LSp computes the loudspeaker playback signals which minimise the acoustic energy in region $\mathcal{M}$ around the microphones ($\bullet$) of the voice driven application (VDA). The control points ($\times$) are not physically placed but modelled within the region the user is expected to be listening. The algorithm limits the acoustic distortion at these points. In (b), a photo of the actual experiment setup is shown. A zoomed-in photo of our VDA implementation using a circular microphone array is shown in the top right corner. In the top-left corner, a zoom-in photo of the loudspeaker emulating the user in the experiments is shown. The microphones on the pink grid are used to evaluate the audio quality in Sec. \ref{['sec4a']}.
  • Figure 2: A schematic view of the setup used in both simulations and real-world experiments. Zoom-ins on the user region (including control points $\mathbf{x}_\text{P}^{(p)}$and the user location $\mathbf{x}_\text{u}$) and on the VDA are provided. The microphone array of the VDA has a radius $\mu_r=0.1$ cm and a height $\mu_z=0.99$ m. Region $\mathcal{M}$ is placed approximately on top of these microphones and is centered at the VDA location $\mathbf{x}_\text{M}$. The loudspeakers are located at positions $\mathbf{x}_\text{L}^{(l)}$.
  • Figure 3: The results of the objective speech quality metric (including zoom-in) at the validation points given as mean-opinion-score (MOS) (a) and the reduction in received energy at the microphones (b) as function of distortion parameter $d$. In (a) the presented results are averages over the different test signals and validation points. In (b) the results of the simulations are averaged over the 100 runs and the eight microphones of the array. The results of the real-world scenario are averaged over the eight microphones of the array.
  • Figure 4: The results for word error rate (WER, upper row) and word information lost (WIL, lower row) as function of SIR in simulated reverberant conditions (a) and in the real room (b), where lower is better, for the three microphone beamformers (indicated by the different colours) and with (dashed line) and without (solid line) loudspeaker spotformer (LSp). The shown results are the average results over the different test signals and voice commands. The high WER is due to the interfering speech signals, outliers were removed.