Deep Filter Estimation from Inter-Frame Correlations for Monaural Speech Dereverberation

Ui-Hyeop Shin; Jun Hyung Kim; Jangyeon Kim; Wooseok Kim; Hyung-Min Park

Deep Filter Estimation from Inter-Frame Correlations for Monaural Speech Dereverberation

Ui-Hyeop Shin, Jun Hyung Kim, Jangyeon Kim, Wooseok Kim, Hyung-Min Park

Abstract

Speech dereverberation in distant-microphone scenarios remains challenging due to the high correlation between reverberation and target signals, often leading to poor generalization in real-world environments. We propose IF-CorrNet, a correlation-to-filter architecture designed for robustness against acoustic variability. Unlike conventional black-box mapping methods that directly estimate complex spectra, IF-CorrNet explicitly exploits inter-frame STFT correlations to estimate multi-frame deep filters for each time-frequency bin. By shifting the learning objective from direct mapping to filter estimation, the network effectively constrains the solution space, which simplifies the training process and mitigates overfitting to synthetic data. Experimental results on the REVERB Challenge dataset demonstrate that IF-CorrNet achieves a substantial gain in the SRMR metric on RealData, confirming its robustness in suppressing reverberation and noise in practical, non-synthetic environments.

Deep Filter Estimation from Inter-Frame Correlations for Monaural Speech Dereverberation

Abstract

Paper Structure (15 sections, 4 equations, 3 figures, 4 tables)

This paper contains 15 sections, 4 equations, 3 figures, 4 tables.

Introduction
IF-CorrNet for Speech Dereverberation
Inter-frame correlations for deep filter estimation
Time-frequency module
Transformer block with ConvFFN module
Experimental Setups
Datasets and evaluation
Training and model configuration
Experimental Results
Investigation on the number of taps
Comparison with existing baselines
Efficiency and distance robustness analysis
Impact of Inter-frame correlations on robustness
Conclusion
Generative AI Use Disclosure

Figures (3)

Figure 1: Overall architecture of IF-CorrNet. Inter-frame correlations are processed through input layer, and frequency and time modules to estimate multi-frame filters.
Figure 2: Plot of (a) PESQ and (b) SRMR results on REVERB Challenge dataset depending on the number of taps $L$.
Figure 3: Spectrogram of sample utterance of RealData on REVERB Challenge Dataset: (a) Input and output from (b) IF-Corr + MF-Filter, (c) SF-Raw + MF-Filter, and (d) SF-Raw + SF-Mask.

Deep Filter Estimation from Inter-Frame Correlations for Monaural Speech Dereverberation

Abstract

Deep Filter Estimation from Inter-Frame Correlations for Monaural Speech Dereverberation

Authors

Abstract

Table of Contents

Figures (3)