LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

Haoyin Yan; Jie Zhang; Cunhang Fan; Yeping Zhou; Peiqi Liu

LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

Haoyin Yan, Jie Zhang, Cunhang Fan, Yeping Zhou, Peiqi Liu

TL;DR

A lightweight SE network (LiSenNet) is proposed for real-time applications with competitive performance with only 37k parameters (half of the state-of-the-art model) and 56M multiply-accumulate (MAC) operations per second.

Abstract

Speech enhancement (SE) aims to extract the clean waveform from noise-contaminated measurements to improve the speech quality and intelligibility. Although learning-based methods can perform much better than traditional counterparts, the large computational complexity and model size heavily limit the deployment on latency-sensitive and low-resource edge devices. In this work, we propose a lightweight SE network (LiSenNet) for real-time applications. We design sub-band downsampling and upsampling blocks and a dual-path recurrent module to capture band-aware features and time-frequency patterns, respectively. A noise detector is developed to detect noisy regions in order to perform SE adaptively and save computational costs. Compared to recent higher-resource-dependent baseline models, the proposed LiSenNet can achieve a competitive performance with only 37k parameters (half of the state-of-the-art model) and 56M multiply-accumulate (MAC) operations per second.

LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

TL;DR

Abstract

Paper Structure (12 sections, 7 equations, 2 figures, 3 tables)

This paper contains 12 sections, 7 equations, 2 figures, 3 tables.

Introduction
METHODOLOGY
Model Input
Encoder and Decoder
Dual-Path Recurrent (DPR) Module
Phase Refinement
Noise Detector (ND)
Loss Function
EXPERIMENTS
Experimental Setup
Experimental Results
CONCLUSION

Figures (2)

Figure 1: (a) The framework of our proposed LiSenNet, where $|\cdot|$ and $\angle\cdot$ denote the magnitude and phase extractors, $\Delta_f$ and $\Delta_t$ the differential operators along frequency and time axis, $(\cdot)^c$ and $(\cdot)^{1/c}$ the power compress and decompress operations at a ratio of $c$, respectively; (b) an example of sub-band DS-Conv and sub-band US-Conv; (c) the proposed DPR module; (d) the optional noise detector, where Mel spectrogram is extracted as the input feature.
Figure 2: The RTF under different noise proportion conditions.

LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

TL;DR

Abstract

LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

Authors

TL;DR

Abstract

Table of Contents

Figures (2)