Dynamically Slimmable Speech Enhancement Network with Metric-Guided Training
Haixin Zhao, Kaixuan Yang, Nilesh Madhu
TL;DR
The paper tackles the high computational cost of single-channel speech enhancement by introducing a Dynamically Slimmable Network (DSN) that gates frame-wise dynamic blocks (GRUs, MHA, Conv, FC) within a shared backbone. A policy module generates a frame-wise gating vector $\mathbf{g}$ via Gumbel-Softmax, and Metric-Guided Training (MGT) conditions the target activation ratio on input quality using DNS-MOS OVRL to guide resource allocation. Empirically, DSN achieves comparable enhancement metrics to a SOTA lightweight baseline while consuming about 73% of the MACs, with MGT further improving performance, especially at low SNRs, and producing adaptive activation patterns correlated with distortion severity. The approach extends dynamic slimming to multiple module types and demonstrates a practical, quality-aware training objective that can guide resource allocation in other models.
Abstract
To further reduce the complexity of lightweight speech enhancement models, we introduce a gating-based Dynamically Slimmable Network (DSN). The DSN comprises static and dynamic components. For architecture-independent applicability, we introduce distinct dynamic structures targeting the commonly used components, namely, grouped recurrent neural network units, multi-head attention, convolutional, and fully connected layers. A policy module adaptively governs the use of dynamic parts at a frame-wise resolution according to the input signal quality, controlling computational load. We further propose Metric-Guided Training (MGT) to explicitly guide the policy module in assessing input speech quality. Experimental results demonstrate that the DSN achieves comparable enhancement performance in instrumental metrics to the state-of-the-art lightweight baseline, while using only 73% of its computational load on average. Evaluations of dynamic component usage ratios indicate that the MGT-DSN can appropriately allocate network resources according to the severity of input signal distortion.
