Table of Contents
Fetching ...

PrimeK-Net: Multi-scale Spectral Learning via Group Prime-Kernel Convolutional Neural Networks for Single Channel Speech Enhancement

Zizhen Lin, Junyu Wang, Ruili Li, Fei Shen, Xi Xuan

TL;DR

PrimeK-Net introduces CNN-based multi-scale spectral learning for single-channel speech enhancement by integrating Group Prime-kernel Feedforward Channel Attention (GPFCA) and Deep Separable Dilated Dense Blocks (DSDDB). The GPFCA module uses a Group Prime-kernel FFN with prime-labeled dilations to capture diverse time-frequency scales while maintaining linear complexity, replacing traditional Conformer-style attention. A memory- and compute-efficient loss framework (L_old vs L_new) assesses the impact of a Consistency loss versus Time loss, enabling fair comparisons with prior methods. On VoiceBank+DEMAND, PrimeK-Net achieves a PESQ of $3.61$ with only $1.41$M parameters, surpassing state-of-the-art baselines and illustrating that CNN-based architectures can rival Transformer-based approaches for single-channel speech enhancement. Ablation studies confirm that prime-kernel multi-scale design and DSDDB contribute to performance gains with reduced computational cost.

Abstract

Single-channel speech enhancement is a challenging ill-posed problem focused on estimating clean speech from degraded signals. Existing studies have demonstrated the competitive performance of combining convolutional neural networks (CNNs) with Transformers in speech enhancement tasks. However, existing frameworks have not sufficiently addressed computational efficiency and have overlooked the natural multi-scale distribution of the spectrum. Additionally, the potential of CNNs in speech enhancement has yet to be fully realized. To address these issues, this study proposes a Deep Separable Dilated Dense Block (DSDDB) and a Group Prime Kernel Feedforward Channel Attention (GPFCA) module. Specifically, the DSDDB introduces higher parameter and computational efficiency to the Encoder/Decoder of existing frameworks. The GPFCA module replaces the position of the Conformer, extracting deep temporal and frequency features of the spectrum with linear complexity. The GPFCA leverages the proposed Group Prime Kernel Feedforward Network (GPFN) to integrate multi-granularity long-range, medium-range, and short-range receptive fields, while utilizing the properties of prime numbers to avoid periodic overlap effects. Experimental results demonstrate that PrimeK-Net, proposed in this study, achieves state-of-the-art (SOTA) performance on the VoiceBank+Demand dataset, reaching a PESQ score of 3.61 with only 1.41M parameters.

PrimeK-Net: Multi-scale Spectral Learning via Group Prime-Kernel Convolutional Neural Networks for Single Channel Speech Enhancement

TL;DR

PrimeK-Net introduces CNN-based multi-scale spectral learning for single-channel speech enhancement by integrating Group Prime-kernel Feedforward Channel Attention (GPFCA) and Deep Separable Dilated Dense Blocks (DSDDB). The GPFCA module uses a Group Prime-kernel FFN with prime-labeled dilations to capture diverse time-frequency scales while maintaining linear complexity, replacing traditional Conformer-style attention. A memory- and compute-efficient loss framework (L_old vs L_new) assesses the impact of a Consistency loss versus Time loss, enabling fair comparisons with prior methods. On VoiceBank+DEMAND, PrimeK-Net achieves a PESQ of with only M parameters, surpassing state-of-the-art baselines and illustrating that CNN-based architectures can rival Transformer-based approaches for single-channel speech enhancement. Ablation studies confirm that prime-kernel multi-scale design and DSDDB contribute to performance gains with reduced computational cost.

Abstract

Single-channel speech enhancement is a challenging ill-posed problem focused on estimating clean speech from degraded signals. Existing studies have demonstrated the competitive performance of combining convolutional neural networks (CNNs) with Transformers in speech enhancement tasks. However, existing frameworks have not sufficiently addressed computational efficiency and have overlooked the natural multi-scale distribution of the spectrum. Additionally, the potential of CNNs in speech enhancement has yet to be fully realized. To address these issues, this study proposes a Deep Separable Dilated Dense Block (DSDDB) and a Group Prime Kernel Feedforward Channel Attention (GPFCA) module. Specifically, the DSDDB introduces higher parameter and computational efficiency to the Encoder/Decoder of existing frameworks. The GPFCA module replaces the position of the Conformer, extracting deep temporal and frequency features of the spectrum with linear complexity. The GPFCA leverages the proposed Group Prime Kernel Feedforward Network (GPFN) to integrate multi-granularity long-range, medium-range, and short-range receptive fields, while utilizing the properties of prime numbers to avoid periodic overlap effects. Experimental results demonstrate that PrimeK-Net, proposed in this study, achieves state-of-the-art (SOTA) performance on the VoiceBank+Demand dataset, reaching a PESQ score of 3.61 with only 1.41M parameters.

Paper Structure

This paper contains 10 sections, 12 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 2: The internal details of the DDB (a) and DSDDB (b).
  • Figure 3: Comparison with other models.