Table of Contents
Fetching ...

Dereverberation Using Binary Residual Masking with Time-Domain Consistency

Daniel G. Williams

TL;DR

The paper tackles real-time vocal dereverberation by introducing a residual binary mask in the STFT domain, predicted by a U-Net and optimized with a hybrid loss that enforces both mask accuracy and time-domain coherence. The method targets late reverberation while keeping the direct speech intact, using a signal model that defines a residual mask $M(f,t)$ and a dereverberated magnitude via $\hat{C}(f,t) = R(f,t) - \hat{\Delta}(f,t)$. Key contributions include the joint optimization of a binary mask, magnitude refinement, and time-domain consistency, plus post-processing to further suppress residual reverb with spectral gating and simple EQ, achieving low latency suitable for live applications (9 ms total). The results demonstrate competitive dereverberation with very high speed (approximately 95× real-time) at the cost of some spectral detail, highlighting practical viability for real-time speech and singing scenarios. Future work is directed at preserving timbre more faithfully while maintaining the single-mask approach.

Abstract

Vocal dereverberation remains a challenging task in audio processing, particularly for real-time applications where both accuracy and efficiency are crucial. Traditional deep learning approaches often struggle to suppress reverberation without degrading vocal clarity, while recent methods that jointly predict magnitude and phase have significant computational cost. We propose a real-time dereverberation framework based on residual mask prediction in the short-time Fourier transform (STFT) domain. A U-Net architecture is trained to estimate a residual reverberation mask that suppresses late reflections while preserving direct speech components. A hybrid objective combining binary cross-entropy, residual magnitude reconstruction, and time-domain consistency further encourages both accurate suppression and perceptual quality. Together, these components enable low-latency dereverberation suitable for real-world speech and singing applications.

Dereverberation Using Binary Residual Masking with Time-Domain Consistency

TL;DR

The paper tackles real-time vocal dereverberation by introducing a residual binary mask in the STFT domain, predicted by a U-Net and optimized with a hybrid loss that enforces both mask accuracy and time-domain coherence. The method targets late reverberation while keeping the direct speech intact, using a signal model that defines a residual mask and a dereverberated magnitude via . Key contributions include the joint optimization of a binary mask, magnitude refinement, and time-domain consistency, plus post-processing to further suppress residual reverb with spectral gating and simple EQ, achieving low latency suitable for live applications (9 ms total). The results demonstrate competitive dereverberation with very high speed (approximately 95× real-time) at the cost of some spectral detail, highlighting practical viability for real-time speech and singing scenarios. Future work is directed at preserving timbre more faithfully while maintaining the single-mask approach.

Abstract

Vocal dereverberation remains a challenging task in audio processing, particularly for real-time applications where both accuracy and efficiency are crucial. Traditional deep learning approaches often struggle to suppress reverberation without degrading vocal clarity, while recent methods that jointly predict magnitude and phase have significant computational cost. We propose a real-time dereverberation framework based on residual mask prediction in the short-time Fourier transform (STFT) domain. A U-Net architecture is trained to estimate a residual reverberation mask that suppresses late reflections while preserving direct speech components. A hybrid objective combining binary cross-entropy, residual magnitude reconstruction, and time-domain consistency further encourages both accurate suppression and perceptual quality. Together, these components enable low-latency dereverberation suitable for real-world speech and singing applications.

Paper Structure

This paper contains 10 sections, 4 equations, 1 figure.

Figures (1)

  • Figure 1: Comparison of clean, predicted, and reverberant spectrograms.