A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model
Dongdi Zhao, Jianbo Ma, Lu Lu, Jinke Li, Xuan Ji, Lei Zhu, Fuming Fang, Ming Liu, Feijun Jiang
TL;DR
The paper addresses the challenge of far-field speech recognition by proposing a unified framework that jointly optimizes neural beamforming and a transformer-based LAS end-to-end model. It introduces a neural beamforming block with spatial filtering across $P$ look directions and spectral filtering via factored Complex Linear Projection, coupled with pooling strategies and optional DOA priors to enhance beamforming. The model is trained end-to-end to optimize recognition performance, and two DOA-aware mechanisms are proposed to further leverage source direction information. Experiments on two large in-house datasets demonstrate a substantial relative improvement (about 19%) over a strong baseline, with projection pooling and DOA integration providing the best gains and robustness to array spacing and DOA errors, highlighting the practical impact for multichannel far-field ASR in real-world scenarios.
Abstract
Far-field speech recognition is a challenging task that conventionally uses signal processing beamforming to attack noise and interference problem. But the performance has been found usually limited due to heavy reliance on environmental assumption. In this paper, we propose a unified multichannel far-field speech recognition system that combines the neural beamforming and transformer-based Listen, Spell, Attend (LAS) speech recognition system, which extends the end-to-end speech recognition system further to include speech enhancement. Such framework is then jointly trained to optimize the final objective of interest. Specifically, factored complex linear projection (fCLP) has been adopted to form the neural beamforming. Several pooling strategies to combine look directions are then compared in order to find the optimal approach. Moreover, information of the source direction is also integrated in the beamforming to explore the usefulness of source direction as a prior, which is usually available especially in multi-modality scenario. Experiments on different microphone array geometry are conducted to evaluate the robustness against spacing variance of microphone array. Large in-house databases are used to evaluate the effectiveness of the proposed framework and the proposed method achieve 19.26\% improvement when compared with a strong baseline.
