RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

Yiwen Shao; Shi-Xiong Zhang; Dong Yu

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

Yiwen Shao, Shi-Xiong Zhang, Dong Yu

TL;DR

The paper addresses target-speech recognition in multi-channel, multi-speaker settings under reverberation, where traditional 3D spatial features falter due to reflection waves. It introduces RIR-SF, a spatial feature derived from convolving the mixture with the target speaker's RIR and measuring phase differences via RP, parameterized by a window k to control reverberation effects, and extended with a neural RIR Conv Block for extraction. The authors demonstrate that RIR-SF outperforms 3D spatial features, achieving robustness in high reverberation and yielding a reported relative CER reduction of $21.3\%$ in multi-channel scenarios, with an all-neural architecture combining RIR-SF, optional ADL-RNNBF, and a RNNT-Conformer backbone. Experiments on AISHELL-1–based synthetic data with varied RT60 confirm improved performance, reduced reverberation sensitivity, and resilience to RIR estimation errors, suggesting broad applicability to ASR, speech separation, and enhancement in reverberant environments.

Abstract

Automatic speech recognition (ASR) on multi-talker recordings is challenging. Current methods using 3D spatial data from multi-channel audio and visual cues focus mainly on direct waves from the target speaker, overlooking reflection wave impacts, which hinders performance in reverberant environments. Our research introduces RIR-SF, a novel spatial feature based on room impulse response (RIR) that leverages the speaker's position, room acoustics, and reflection dynamics. RIR-SF significantly outperforms traditional 3D spatial features, showing superior theoretical and empirical performance. We also propose an optimized all-neural multi-channel ASR framework for RIR-SF, achieving a relative 21.3\% reduction in CER for target speaker ASR in multi-channel settings. RIR-SF enhances recognition accuracy and demonstrates robustness in high-reverberation scenarios, overcoming the limitations of previous methods.

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

TL;DR

in multi-channel scenarios, with an all-neural architecture combining RIR-SF, optional ADL-RNNBF, and a RNNT-Conformer backbone. Experiments on AISHELL-1–based synthetic data with varied RT60 confirm improved performance, reduced reverberation sensitivity, and resilience to RIR estimation errors, suggesting broad applicability to ASR, speech separation, and enhancement in reverberant environments.

Abstract

Paper Structure (14 sections, 14 equations, 3 figures, 2 tables)

This paper contains 14 sections, 14 equations, 3 figures, 2 tables.

Introduction
Problem with 3D Spatial Feature
Signal Model
3D Spatial Feature from MTF/CTF Perspective
RIR Based Spatial Feature
Why RIR-SF is better
System
RIR Conv Block
ADL-RNNBF
RNNT-Conformer
Experiments
Dataset
Experimental Results Analysis
Conclusions

Figures (3)

Figure 1: An illustration of 3D spatial features with (a) weak reverberation and (b) strong reverberation. In (a), $\mathbb{SF}_1$ matches well with the pattern of the Log Power Spectrum (LPS) of the target speaker $X_1$. While in (b), $\mathbb{SF}_1$ fails to identify the target source from the mixture.
Figure 2: An illustration of $\mathbb{RSF}_i(t;k)$ and $\lVert C_i^{m}(t;k)\rVert$ for the example in Figure \ref{['fig:3d_sf']} (b) with different k.
Figure 3: (a) RIR Conv Block that utilizes convolution layers to extract RIR-based spatial feature $\mathbb{RSF}_i(t;k)$; (b) The whole Multi-channel Multi-speaker ASR system that combines $\mathbb{RSF}_i(t;k)$ and ADL-RNNBF.

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

TL;DR

Abstract

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

Authors

TL;DR

Abstract

Table of Contents

Figures (3)