Table of Contents
Fetching ...

ZipEnhancer: Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement

Haoxu Wang, Biao Tian

TL;DR

This work tackles the high computation in Dual-Path time-frequency speech enhancement by modeling hidden features with four axes $(B,T,F,C)$ and introducing time- and frequency-domain Down-Up sampling in a Zipformer-based architecture. ZipEnhancer integrates DownSampleStacks within Dual-Path ZipformerBlocks to achieve symmetric downsampling and upsampling, reducing cost while maintaining performance. The approach is trained with ScaleAdam and Eden scheduler and optimized by a multi-term loss including PESQ-based, STFT, magnitude, complex, phase, and time-domain components, yielding state-of-the-art PESQ on DNS2020 ($3.69$) and VoiceBank+DEMAND ($3.63$) with about $2.04$M parameters and approximately $62.4$ GFLOPS. The results show a practical, efficient solution for monaural SE with potential for causal real-time deployment in real-world systems.

Abstract

In contrast to other sequence tasks modeling hidden layer features with three axes, Dual-Path time and time-frequency domain speech enhancement models are effective and have low parameters but are computationally demanding due to their hidden layer features with four axes. We propose ZipEnhancer, which is Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement, incorporating time and frequency domain Down-Up sampling to reduce computational costs. We introduce the ZipformerBlock as the core block and propose the design of the Dual-Path DownSampleStacks that symmetrically scale down and scale up. Also, we introduce the ScaleAdam optimizer and Eden learning rate scheduler to improve the performance further. Our model achieves new state-of-the-art results on the DNS 2020 Challenge and Voicebank+DEMAND datasets, with a perceptual evaluation of speech quality (PESQ) of 3.69 and 3.63, using 2.04M parameters and 62.41G FLOPS, outperforming other methods with similar complexity levels.

ZipEnhancer: Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement

TL;DR

This work tackles the high computation in Dual-Path time-frequency speech enhancement by modeling hidden features with four axes and introducing time- and frequency-domain Down-Up sampling in a Zipformer-based architecture. ZipEnhancer integrates DownSampleStacks within Dual-Path ZipformerBlocks to achieve symmetric downsampling and upsampling, reducing cost while maintaining performance. The approach is trained with ScaleAdam and Eden scheduler and optimized by a multi-term loss including PESQ-based, STFT, magnitude, complex, phase, and time-domain components, yielding state-of-the-art PESQ on DNS2020 () and VoiceBank+DEMAND () with about M parameters and approximately GFLOPS. The results show a practical, efficient solution for monaural SE with potential for causal real-time deployment in real-world systems.

Abstract

In contrast to other sequence tasks modeling hidden layer features with three axes, Dual-Path time and time-frequency domain speech enhancement models are effective and have low parameters but are computationally demanding due to their hidden layer features with four axes. We propose ZipEnhancer, which is Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement, incorporating time and frequency domain Down-Up sampling to reduce computational costs. We introduce the ZipformerBlock as the core block and propose the design of the Dual-Path DownSampleStacks that symmetrically scale down and scale up. Also, we introduce the ScaleAdam optimizer and Eden learning rate scheduler to improve the performance further. Our model achieves new state-of-the-art results on the DNS 2020 Challenge and Voicebank+DEMAND datasets, with a perceptual evaluation of speech quality (PESQ) of 3.69 and 3.63, using 2.04M parameters and 62.41G FLOPS, outperforming other methods with similar complexity levels.
Paper Structure (18 sections, 3 figures, 4 tables)

This paper contains 18 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overall architecture of our ZipEnhancer.
  • Figure 2: (Left): The structure of T/F-ZipformerBlock. (Right): The structure of Non-Linear Attention module.
  • Figure 3: Spectrogram visualization of the natural noisy/clean speech, and speeches enhanced by the MP-SENet Up. and our proposed ZipEnhancer(S).