From Diet to Free Lunch: Estimating Auxiliary Signal Properties using Dynamic Pruning Masks in Speech Enhancement Networks
Riccardo Miccini, Clément Laroche, Tobias Piechowiak, Xenofon Fafoutis, Luca Pezzarossa
TL;DR
This work addresses the challenge of providing auxiliary signal properties (e.g., VAD, noise category, $F_0$) for on-device speech enhancement without adding extra models. It leverages Dynamic Channel Pruning gating masks from a Conv-FSENet backbone and treats the masked activations as input features for lightweight linear/logistic predictors, achieving high accuracy with negligible computational overhead. The study demonstrates that a subset of about $C^\star \approx 202$ features suffices, with up to 93% VAD accuracy and $R^2 \approx 0.86$ for $F_0$, and provides visualizations (t-SNE, heatmaps) to illustrate the learned structure. Practically, this enables a single, efficient model to deliver SE while simultaneously estimating useful auxiliary attributes, benefiting robustness, privacy, and user experience on edge devices. ${}$Future work includes exploring non-linear predictors and joint training to further enhance performance and applicability.$
Abstract
Speech Enhancement (SE) in audio devices is often supported by auxiliary modules for Voice Activity Detection (VAD), SNR estimation, or Acoustic Scene Classification to ensure robust context-aware behavior and seamless user experience. Just like SE, these tasks often employ deep learning; however, deploying additional models on-device is computationally impractical, whereas cloud-based inference would introduce additional latency and compromise privacy. Prior work on SE employed Dynamic Channel Pruning (DynCP) to reduce computation by adaptively disabling specific channels based on the current input. In this work, we investigate whether useful signal properties can be estimated from these internal pruning masks, thus removing the need for separate models. We show that simple, interpretable predictors achieve up to 93% accuracy on VAD, 84% on noise classification, and an R2 of 0.86 on F0 estimation. With binary masks, predictions reduce to weighted sums, inducing negligible overhead. Our contribution is twofold: on one hand, we examine the emergent behavior of DynCP models through the lens of downstream prediction tasks, to reveal what they are learning; on the other, we repurpose and re-propose DynCP as a holistic solution for efficient SE and simultaneous estimation of signal properties.
