Loudspeaker Beamforming to Enhance Speech Recognition Performance of Voice Driven Applications
Dimme de Groot, Baturalp Karslioglu, Odette Scharenborg, Jorge Martinez
TL;DR
The paper tackles ASR robustness in voice-driven applications when loudspeakers become a major source of interference. It introduces loudspeaker spotforming (LSp), a near-field beamforming approach that shapes playback signals to create a low-energy zone around the VDA while limiting perceptual distortion via an auditory-masking-based constraint, formulated as a convex optimization solved with CVX. The method relies on a region-based spatial covariance model, using a probability distribution p_\mathcal{M} to describe likely microphone positions and optionally including late reverberation through R_\text{iso}. Experimental results in simulations and real rooms show significant energy reduction near the VDA with minimal degradation in listener quality and clear improvements in ASR performance (WER/WIL) under adverse SIRs, validating the approach and its robustness. A key contribution is the demonstrated trade-off controlled by the distortion parameter d, enabling practical deployment, though computational complexity remains a bottleneck for real-time operation.
Abstract
In this paper we propose a robust loudspeaker beamforming algorithm which is used to enhance the performance of voice driven applications in scenarios where the loudspeakers introduce the majority of the noise, e.g. when music is playing loudly. The loudspeaker beamformer modifies the loudspeaker playback signals to create a low-acoustic-energy region around the device that implements automatic speech recognition for a voice driven application (VDA). The algorithm utilises a distortion measure based on human auditory perception to limit the distortion perceived by human listeners. Simulations and real-world experiments show that the proposed loudspeaker beamformer improves the speech recognition performance in all tested scenarios. Moreover, the algorithm allows to further reduce the acoustic energy around the VDA device at the expense of reduced objective audio quality at the listener's location.
