Table of Contents
Fetching ...

Adaptive Convolution for CNN-based Speech Enhancement Models

Dahan Wang, Xiaobin Rong, Shiruo Sun, Yuxiang Hu, Changbao Zhu, Jing Lu

TL;DR

This work introduces adaptive convolution, a frame-wise causal dynamic convolution that generates time-varying kernels by aggregating $K$ candidate kernels with frame-level attention $A_k(t)$, enabling CNN-based speech enhancement to better track non-stationary spectral features. The authors develop a lightweight kernel-attention framework with options for multi-layer and joint attention, plus multi-frame parallelism, and demonstrate its effectiveness across diverse backbones. They further propose AdaptCRN, an ultra-lightweight encoder–decoder model incorporating adaptive convolution and spectral compression to achieve competitive results at around 40M MACs. Across extensive experiments, adaptive convolution yields significant gains for lightweight models, with strong qualitative and quantitative evidence that kernel allocation aligns with speech spectral characteristics; these findings are reinforced by ablations and visualizations. Overall, adaptive convolution offers a practical, scalable enhancement for real-time speech processing with broad applicability to CNN-based SE systems.

Abstract

Deep learning-based speech enhancement methods have significantly improved speech quality and intelligibility. Convolutional neural networks (CNNs) have been proven to be essential components of many high-performance models. In this paper, we introduce adaptive convolution, an efficient and versatile convolutional module that enhances the model's capability to adaptively represent speech signals. Adaptive convolution performs frame-wise causal dynamic convolution, generating time-varying kernels for each frame by assembling multiple parallel candidate kernels. A lightweight attention mechanism is proposed for adaptive convolution, leveraging both current and historical information to assign adaptive weights to each candidate kernel. This enables the convolution operation to adapt to frame-level speech spectral features, leading to more efficient extraction and reconstruction. We integrate adaptive convolution into various CNN-based models, highlighting its generalizability. Experimental results demonstrate that adaptive convolution significantly improves the performance with negligible increases in computational complexity, especially for lightweight models. Moreover, we present an intuitive analysis revealing a strong correlation between kernel selection and signal characteristics. Furthermore, we propose the adaptive convolutional recurrent network (AdaptCRN), an ultra-lightweight model that incorporates adaptive convolution and an efficient encoder-decoder design, achieving superior performance compared to models with similar or even higher computational costs.

Adaptive Convolution for CNN-based Speech Enhancement Models

TL;DR

This work introduces adaptive convolution, a frame-wise causal dynamic convolution that generates time-varying kernels by aggregating candidate kernels with frame-level attention , enabling CNN-based speech enhancement to better track non-stationary spectral features. The authors develop a lightweight kernel-attention framework with options for multi-layer and joint attention, plus multi-frame parallelism, and demonstrate its effectiveness across diverse backbones. They further propose AdaptCRN, an ultra-lightweight encoder–decoder model incorporating adaptive convolution and spectral compression to achieve competitive results at around 40M MACs. Across extensive experiments, adaptive convolution yields significant gains for lightweight models, with strong qualitative and quantitative evidence that kernel allocation aligns with speech spectral characteristics; these findings are reinforced by ablations and visualizations. Overall, adaptive convolution offers a practical, scalable enhancement for real-time speech processing with broad applicability to CNN-based SE systems.

Abstract

Deep learning-based speech enhancement methods have significantly improved speech quality and intelligibility. Convolutional neural networks (CNNs) have been proven to be essential components of many high-performance models. In this paper, we introduce adaptive convolution, an efficient and versatile convolutional module that enhances the model's capability to adaptively represent speech signals. Adaptive convolution performs frame-wise causal dynamic convolution, generating time-varying kernels for each frame by assembling multiple parallel candidate kernels. A lightweight attention mechanism is proposed for adaptive convolution, leveraging both current and historical information to assign adaptive weights to each candidate kernel. This enables the convolution operation to adapt to frame-level speech spectral features, leading to more efficient extraction and reconstruction. We integrate adaptive convolution into various CNN-based models, highlighting its generalizability. Experimental results demonstrate that adaptive convolution significantly improves the performance with negligible increases in computational complexity, especially for lightweight models. Moreover, we present an intuitive analysis revealing a strong correlation between kernel selection and signal characteristics. Furthermore, we propose the adaptive convolutional recurrent network (AdaptCRN), an ultra-lightweight model that incorporates adaptive convolution and an efficient encoder-decoder design, achieving superior performance compared to models with similar or even higher computational costs.

Paper Structure

This paper contains 29 sections, 14 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Schematic illustration of adaptive convolution: (a) overall architecture of adaptive convolution layer, (b) architecture of kernel attention, (c) natural structure of a convolutional block with multiple sub-layers incorporating adaptive convolution, and (d) corresponding structure using joint multi-layer kernel attention.
  • Figure 2: The architecture of (a) AdaptCRN, (b) the basic block, and (c) the adaptive block.
  • Figure 3: Violin plots of (a) SI-SNR, (b) PESQ, and (c) DNSMOS-OVRL results for GTCRN and AdaptCRN on the DNS5 test set.
  • Figure 4: Example spectrograms from DNS5 test set and kernel attention weight visualization: (a) noisy signal, (b) clean signal, (c) signal enhanced by DPCRN-light, (d) signal enhanced by DPCRN-light with adaptive convolution, (e) signal enhanced by AdaptCRN, (f) visualization of kernel attention weights across frames, derived from the third layer of the decoder in DPCRN-light with adaptive convolution.
  • Figure 5: The proportion of frames in which the $k$-th candidate kernel is selected as the dominant kernel across speech or non-speech frames, measured from the third decoder layer of DPCRN-light with adaptive convolution.