Table of Contents
Fetching ...

Mel-McNet: A Mel-Scale Framework for Online Multichannel Speech Enhancement

Yujie Yang, Bing Yang, Xiaofei Li

TL;DR

This work tackles online multichannel speech enhancement under real-time computation constraints by introducing Mel-McNet, a Mel-scale framework that embeds STFT-to-Mel compression before a Mel-domain McNet backbone to process spectral and spatial cues and produce enhanced LogMel spectra. The learning target is a Mel-scale power ratio mask Mel-PRM, trained with a mean-squared error loss and defined as $\text{Mel-PRM} = \min\left( \sqrt{ \dfrac{S^{\mathrm{Mel}}_{r}(t,f^{\prime})}{X^{\mathrm{Mel}}_{r}(t,f^{\prime})} }, 1 \right)$, enabling direct vocoder or ASR downstream use. Empirically, Mel-McNet achieves about 60% FLOPs reduction while maintaining competitive speech enhancement and ASR performance and even surpassing state-of-the-art methods on CHiME-3 in DNSMOS and WER, albeit with slight trade-offs in WB-PESQ and STOI due to vocoder effects. The results demonstrate the feasibility and benefits of Mel-scale multichannel processing for efficient, real-time deployment, with future work aimed at generalizing to other lightweight backbones.

Abstract

Online multichannel speech enhancement has been intensively studied recently. Though Mel-scale frequency is more matched with human auditory perception and computationally efficient than linear frequency, few works are implemented in a Mel-frequency domain. To this end, this work proposes a Mel-scale framework (namely Mel-McNet). It processes spectral and spatial information with two key components: an effective STFT-to-Mel module compressing multi-channel STFT features into Mel-frequency representations, and a modified McNet backbone directly operating in the Mel domain to generate enhanced LogMel spectra. The spectra can be directly fed to vocoders for waveform reconstruction or ASR systems for transcription. Experiments on CHiME-3 show that Mel-McNet can reduce computational complexity by 60% while maintaining comparable enhancement and ASR performance to the original McNet. Mel-McNet also outperforms other SOTA methods, verifying the potential of Mel-scale speech enhancement.

Mel-McNet: A Mel-Scale Framework for Online Multichannel Speech Enhancement

TL;DR

This work tackles online multichannel speech enhancement under real-time computation constraints by introducing Mel-McNet, a Mel-scale framework that embeds STFT-to-Mel compression before a Mel-domain McNet backbone to process spectral and spatial cues and produce enhanced LogMel spectra. The learning target is a Mel-scale power ratio mask Mel-PRM, trained with a mean-squared error loss and defined as , enabling direct vocoder or ASR downstream use. Empirically, Mel-McNet achieves about 60% FLOPs reduction while maintaining competitive speech enhancement and ASR performance and even surpassing state-of-the-art methods on CHiME-3 in DNSMOS and WER, albeit with slight trade-offs in WB-PESQ and STOI due to vocoder effects. The results demonstrate the feasibility and benefits of Mel-scale multichannel processing for efficient, real-time deployment, with future work aimed at generalizing to other lightweight backbones.

Abstract

Online multichannel speech enhancement has been intensively studied recently. Though Mel-scale frequency is more matched with human auditory perception and computationally efficient than linear frequency, few works are implemented in a Mel-frequency domain. To this end, this work proposes a Mel-scale framework (namely Mel-McNet). It processes spectral and spatial information with two key components: an effective STFT-to-Mel module compressing multi-channel STFT features into Mel-frequency representations, and a modified McNet backbone directly operating in the Mel domain to generate enhanced LogMel spectra. The spectra can be directly fed to vocoders for waveform reconstruction or ASR systems for transcription. Experiments on CHiME-3 show that Mel-McNet can reduce computational complexity by 60% while maintaining comparable enhancement and ASR performance to the original McNet. Mel-McNet also outperforms other SOTA methods, verifying the potential of Mel-scale speech enhancement.

Paper Structure

This paper contains 11 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The proposed Mel-scale framework. (a) The system overview. (b) The diagram of Mel-McNet. (c) The architecture of STFT-to-Mel module. $F$, $F'$, $T$, and $M$ represent the number of linear-scale STFT frequencies, nonlinear Mel-scale frequencies, time frames, and microphones, respectively. $N_{1}$ and $N_{2}$ are the number of adjacent frequencies. $C$ is the number of context frames. $D$ is the dimension of hidden embeddings. For the McNet backbone, the feature dimension follows the form of "batch dimension[dimension of one sample in a batch]", and the dash-dotted boxes indicate dimension transformation operations (please see details in mcnet).