Table of Contents
Fetching ...

Dynamically Slimmable Speech Enhancement Network with Metric-Guided Training

Haixin Zhao, Kaixuan Yang, Nilesh Madhu

TL;DR

The paper tackles the high computational cost of single-channel speech enhancement by introducing a Dynamically Slimmable Network (DSN) that gates frame-wise dynamic blocks (GRUs, MHA, Conv, FC) within a shared backbone. A policy module generates a frame-wise gating vector $\mathbf{g}$ via Gumbel-Softmax, and Metric-Guided Training (MGT) conditions the target activation ratio on input quality using DNS-MOS OVRL to guide resource allocation. Empirically, DSN achieves comparable enhancement metrics to a SOTA lightweight baseline while consuming about 73% of the MACs, with MGT further improving performance, especially at low SNRs, and producing adaptive activation patterns correlated with distortion severity. The approach extends dynamic slimming to multiple module types and demonstrates a practical, quality-aware training objective that can guide resource allocation in other models.

Abstract

To further reduce the complexity of lightweight speech enhancement models, we introduce a gating-based Dynamically Slimmable Network (DSN). The DSN comprises static and dynamic components. For architecture-independent applicability, we introduce distinct dynamic structures targeting the commonly used components, namely, grouped recurrent neural network units, multi-head attention, convolutional, and fully connected layers. A policy module adaptively governs the use of dynamic parts at a frame-wise resolution according to the input signal quality, controlling computational load. We further propose Metric-Guided Training (MGT) to explicitly guide the policy module in assessing input speech quality. Experimental results demonstrate that the DSN achieves comparable enhancement performance in instrumental metrics to the state-of-the-art lightweight baseline, while using only 73% of its computational load on average. Evaluations of dynamic component usage ratios indicate that the MGT-DSN can appropriately allocate network resources according to the severity of input signal distortion.

Dynamically Slimmable Speech Enhancement Network with Metric-Guided Training

TL;DR

The paper tackles the high computational cost of single-channel speech enhancement by introducing a Dynamically Slimmable Network (DSN) that gates frame-wise dynamic blocks (GRUs, MHA, Conv, FC) within a shared backbone. A policy module generates a frame-wise gating vector via Gumbel-Softmax, and Metric-Guided Training (MGT) conditions the target activation ratio on input quality using DNS-MOS OVRL to guide resource allocation. Empirically, DSN achieves comparable enhancement metrics to a SOTA lightweight baseline while consuming about 73% of the MACs, with MGT further improving performance, especially at low SNRs, and producing adaptive activation patterns correlated with distortion severity. The approach extends dynamic slimming to multiple module types and demonstrates a practical, quality-aware training objective that can guide resource allocation in other models.

Abstract

To further reduce the complexity of lightweight speech enhancement models, we introduce a gating-based Dynamically Slimmable Network (DSN). The DSN comprises static and dynamic components. For architecture-independent applicability, we introduce distinct dynamic structures targeting the commonly used components, namely, grouped recurrent neural network units, multi-head attention, convolutional, and fully connected layers. A policy module adaptively governs the use of dynamic parts at a frame-wise resolution according to the input signal quality, controlling computational load. We further propose Metric-Guided Training (MGT) to explicitly guide the policy module in assessing input speech quality. Experimental results demonstrate that the DSN achieves comparable enhancement performance in instrumental metrics to the state-of-the-art lightweight baseline, while using only 73% of its computational load on average. Evaluations of dynamic component usage ratios indicate that the MGT-DSN can appropriately allocate network resources according to the severity of input signal distortion.

Paper Structure

This paper contains 8 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The architecture of the Dynamic Slimmable Network, in which the third convolutional layer pair, GRU and MHA modules, are dynamic components. The static network paths, denoted by solid arrows, are always processed, while dynamic paths (dotted lines) can be dynamically slimmed during inference, guided by the estimated frame-wise gating vector $\textbf{g}$. $\textbf{g}$ is identically used for all dynamic modules. B denotes batch size, and T, F, and 32 are the tensor sizes along time, frequency and channel, respectively. Boxed $G$ denotes gating operations.
  • Figure 2: The structure of the dynamic GRU block in frequency transformers. Circled $C$ denotes concatenation.
  • Figure 3: The structure of the GRU cell in the time transformer’s dynamic GRU groups. $\boldsymbol{h}$ denotes hidden states, and $\boldsymbol{x}_t$ is input.
  • Figure 4: The dynamic MHA structure. The parallel linear projections of query, key, and value feature maps share the same dynamic linear block structure but differ in their independent learnable parameters.
  • Figure 5: Evaluation results of the proposed dynamic models benchmarked against two static baselines. Performance gains are illustrated relative to the zero-activation baseline, that is, the dynamic model with all dynamic components deactivated, resulting in a computational complexity of 141 M MACs/s. The six-panel plot contrasts the static FFT-Net (equivalent to the 100% activated dynamic model), the standard dynamic model, and the MGT dynamic model across six instrumental metrics, at SNRs from 5 dB to 20 dB. Both dynamic models exhibit a 50% average activation ratio on the test dataset, corresponding to a computational complexity of 221 M MACs/s. The average metric score of the zero-activation baseline is indicated below the corresponding SNRs. The average MACs/s of models are provided as well for evaluation.
  • ...and 1 more figures