RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios
Yiwen Shao, Shi-Xiong Zhang, Dong Yu
TL;DR
The paper addresses target-speech recognition in multi-channel, multi-speaker settings under reverberation, where traditional 3D spatial features falter due to reflection waves. It introduces RIR-SF, a spatial feature derived from convolving the mixture with the target speaker's RIR and measuring phase differences via RP, parameterized by a window k to control reverberation effects, and extended with a neural RIR Conv Block for extraction. The authors demonstrate that RIR-SF outperforms 3D spatial features, achieving robustness in high reverberation and yielding a reported relative CER reduction of $21.3\%$ in multi-channel scenarios, with an all-neural architecture combining RIR-SF, optional ADL-RNNBF, and a RNNT-Conformer backbone. Experiments on AISHELL-1–based synthetic data with varied RT60 confirm improved performance, reduced reverberation sensitivity, and resilience to RIR estimation errors, suggesting broad applicability to ASR, speech separation, and enhancement in reverberant environments.
Abstract
Automatic speech recognition (ASR) on multi-talker recordings is challenging. Current methods using 3D spatial data from multi-channel audio and visual cues focus mainly on direct waves from the target speaker, overlooking reflection wave impacts, which hinders performance in reverberant environments. Our research introduces RIR-SF, a novel spatial feature based on room impulse response (RIR) that leverages the speaker's position, room acoustics, and reflection dynamics. RIR-SF significantly outperforms traditional 3D spatial features, showing superior theoretical and empirical performance. We also propose an optimized all-neural multi-channel ASR framework for RIR-SF, achieving a relative 21.3\% reduction in CER for target speaker ASR in multi-channel settings. RIR-SF enhances recognition accuracy and demonstrates robustness in high-reverberation scenarios, overcoming the limitations of previous methods.
