A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

Dongdi Zhao; Jianbo Ma; Lu Lu; Jinke Li; Xuan Ji; Lei Zhu; Fuming Fang; Ming Liu; Feijun Jiang

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

Dongdi Zhao, Jianbo Ma, Lu Lu, Jinke Li, Xuan Ji, Lei Zhu, Fuming Fang, Ming Liu, Feijun Jiang

TL;DR

The paper addresses the challenge of far-field speech recognition by proposing a unified framework that jointly optimizes neural beamforming and a transformer-based LAS end-to-end model. It introduces a neural beamforming block with spatial filtering across $P$ look directions and spectral filtering via factored Complex Linear Projection, coupled with pooling strategies and optional DOA priors to enhance beamforming. The model is trained end-to-end to optimize recognition performance, and two DOA-aware mechanisms are proposed to further leverage source direction information. Experiments on two large in-house datasets demonstrate a substantial relative improvement (about 19%) over a strong baseline, with projection pooling and DOA integration providing the best gains and robustness to array spacing and DOA errors, highlighting the practical impact for multichannel far-field ASR in real-world scenarios.

Abstract

Far-field speech recognition is a challenging task that conventionally uses signal processing beamforming to attack noise and interference problem. But the performance has been found usually limited due to heavy reliance on environmental assumption. In this paper, we propose a unified multichannel far-field speech recognition system that combines the neural beamforming and transformer-based Listen, Spell, Attend (LAS) speech recognition system, which extends the end-to-end speech recognition system further to include speech enhancement. Such framework is then jointly trained to optimize the final objective of interest. Specifically, factored complex linear projection (fCLP) has been adopted to form the neural beamforming. Several pooling strategies to combine look directions are then compared in order to find the optimal approach. Moreover, information of the source direction is also integrated in the beamforming to explore the usefulness of source direction as a prior, which is usually available especially in multi-modality scenario. Experiments on different microphone array geometry are conducted to evaluate the robustness against spacing variance of microphone array. Large in-house databases are used to evaluate the effectiveness of the proposed framework and the proposed method achieve 19.26\% improvement when compared with a strong baseline.

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

TL;DR

look directions and spectral filtering via factored Complex Linear Projection, coupled with pooling strategies and optional DOA priors to enhance beamforming. The model is trained end-to-end to optimize recognition performance, and two DOA-aware mechanisms are proposed to further leverage source direction information. Experiments on two large in-house datasets demonstrate a substantial relative improvement (about 19%) over a strong baseline, with projection pooling and DOA integration providing the best gains and robustness to array spacing and DOA errors, highlighting the practical impact for multichannel far-field ASR in real-world scenarios.

Abstract

Paper Structure (14 sections, 11 equations, 5 figures, 3 tables)

This paper contains 14 sections, 11 equations, 5 figures, 3 tables.

Introduction
CTC/attention system
Proposed framework
Neural Beamforming
Neural Beamforming with ASR
Integrating source direction
Experiments
Databases
System Configurations
Results
Performance on different microphone array spacing
Performance of integrating source direction
Performance against DOA resolution
Conclusion

Figures (5)

Figure 1: Attention-based End-to-End Architecture.
Figure 2: Architecture of proposed unified neural beamforming end-to-end speech recognition system.
Figure 3: Illustration of different pooling methods.
Figure 4: Illustration of integrating source direction.
Figure 5: WER varies with DOA error rate.

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

TL;DR

Abstract

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

Authors

TL;DR

Abstract

Table of Contents

Figures (5)