Adaptive Convolution for CNN-based Speech Enhancement Models

Dahan Wang; Xiaobin Rong; Shiruo Sun; Yuxiang Hu; Changbao Zhu; Jing Lu

Adaptive Convolution for CNN-based Speech Enhancement Models

Dahan Wang, Xiaobin Rong, Shiruo Sun, Yuxiang Hu, Changbao Zhu, Jing Lu

TL;DR

This work introduces adaptive convolution, a frame-wise causal dynamic convolution that generates time-varying kernels by aggregating $K$ candidate kernels with frame-level attention $A_k(t)$, enabling CNN-based speech enhancement to better track non-stationary spectral features. The authors develop a lightweight kernel-attention framework with options for multi-layer and joint attention, plus multi-frame parallelism, and demonstrate its effectiveness across diverse backbones. They further propose AdaptCRN, an ultra-lightweight encoder–decoder model incorporating adaptive convolution and spectral compression to achieve competitive results at around 40M MACs. Across extensive experiments, adaptive convolution yields significant gains for lightweight models, with strong qualitative and quantitative evidence that kernel allocation aligns with speech spectral characteristics; these findings are reinforced by ablations and visualizations. Overall, adaptive convolution offers a practical, scalable enhancement for real-time speech processing with broad applicability to CNN-based SE systems.

Abstract

Deep learning-based speech enhancement methods have significantly improved speech quality and intelligibility. Convolutional neural networks (CNNs) have been proven to be essential components of many high-performance models. In this paper, we introduce adaptive convolution, an efficient and versatile convolutional module that enhances the model's capability to adaptively represent speech signals. Adaptive convolution performs frame-wise causal dynamic convolution, generating time-varying kernels for each frame by assembling multiple parallel candidate kernels. A lightweight attention mechanism is proposed for adaptive convolution, leveraging both current and historical information to assign adaptive weights to each candidate kernel. This enables the convolution operation to adapt to frame-level speech spectral features, leading to more efficient extraction and reconstruction. We integrate adaptive convolution into various CNN-based models, highlighting its generalizability. Experimental results demonstrate that adaptive convolution significantly improves the performance with negligible increases in computational complexity, especially for lightweight models. Moreover, we present an intuitive analysis revealing a strong correlation between kernel selection and signal characteristics. Furthermore, we propose the adaptive convolutional recurrent network (AdaptCRN), an ultra-lightweight model that incorporates adaptive convolution and an efficient encoder-decoder design, achieving superior performance compared to models with similar or even higher computational costs.

Adaptive Convolution for CNN-based Speech Enhancement Models

TL;DR

Abstract

Adaptive Convolution for CNN-based Speech Enhancement Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)