ZipEnhancer: Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement

Haoxu Wang; Biao Tian

ZipEnhancer: Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement

Haoxu Wang, Biao Tian

TL;DR

This work tackles the high computation in Dual-Path time-frequency speech enhancement by modeling hidden features with four axes $(B,T,F,C)$ and introducing time- and frequency-domain Down-Up sampling in a Zipformer-based architecture. ZipEnhancer integrates DownSampleStacks within Dual-Path ZipformerBlocks to achieve symmetric downsampling and upsampling, reducing cost while maintaining performance. The approach is trained with ScaleAdam and Eden scheduler and optimized by a multi-term loss including PESQ-based, STFT, magnitude, complex, phase, and time-domain components, yielding state-of-the-art PESQ on DNS2020 ($3.69$) and VoiceBank+DEMAND ($3.63$) with about $2.04$M parameters and approximately $62.4$ GFLOPS. The results show a practical, efficient solution for monaural SE with potential for causal real-time deployment in real-world systems.

Abstract

In contrast to other sequence tasks modeling hidden layer features with three axes, Dual-Path time and time-frequency domain speech enhancement models are effective and have low parameters but are computationally demanding due to their hidden layer features with four axes. We propose ZipEnhancer, which is Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement, incorporating time and frequency domain Down-Up sampling to reduce computational costs. We introduce the ZipformerBlock as the core block and propose the design of the Dual-Path DownSampleStacks that symmetrically scale down and scale up. Also, we introduce the ScaleAdam optimizer and Eden learning rate scheduler to improve the performance further. Our model achieves new state-of-the-art results on the DNS 2020 Challenge and Voicebank+DEMAND datasets, with a perceptual evaluation of speech quality (PESQ) of 3.69 and 3.63, using 2.04M parameters and 62.41G FLOPS, outperforming other methods with similar complexity levels.

ZipEnhancer: Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement

TL;DR

This work tackles the high computation in Dual-Path time-frequency speech enhancement by modeling hidden features with four axes

and introducing time- and frequency-domain Down-Up sampling in a Zipformer-based architecture. ZipEnhancer integrates DownSampleStacks within Dual-Path ZipformerBlocks to achieve symmetric downsampling and upsampling, reducing cost while maintaining performance. The approach is trained with ScaleAdam and Eden scheduler and optimized by a multi-term loss including PESQ-based, STFT, magnitude, complex, phase, and time-domain components, yielding state-of-the-art PESQ on DNS2020 (

) and VoiceBank+DEMAND (

) with about

M parameters and approximately

GFLOPS. The results show a practical, efficient solution for monaural SE with potential for causal real-time deployment in real-world systems.

Abstract

Paper Structure (18 sections, 3 figures, 4 tables)

This paper contains 18 sections, 3 figures, 4 tables.

Introduction
The proposed ZipEnhancer
The overall architecture of the model
DualPathZipformerBlocks
DownSampleStacks
ZipformerBlock
Encoder and Decoders
Training criteria
Optimizer and Learning Sceduler
Loss function
Experiments
Dataset
Expermental Setup
Evaluation metrics
Expermental Results
...and 3 more sections

Figures (3)

Figure 1: The overall architecture of our ZipEnhancer.
Figure 2: (Left): The structure of T/F-ZipformerBlock. (Right): The structure of Non-Linear Attention module.
Figure 3: Spectrogram visualization of the natural noisy/clean speech, and speeches enhanced by the MP-SENet Up. and our proposed ZipEnhancer(S).

ZipEnhancer: Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement

TL;DR

Abstract

ZipEnhancer: Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement

Authors

TL;DR

Abstract

Table of Contents

Figures (3)