Table of Contents
Fetching ...

From Diet to Free Lunch: Estimating Auxiliary Signal Properties using Dynamic Pruning Masks in Speech Enhancement Networks

Riccardo Miccini, Clément Laroche, Tobias Piechowiak, Xenofon Fafoutis, Luca Pezzarossa

TL;DR

This work addresses the challenge of providing auxiliary signal properties (e.g., VAD, noise category, $F_0$) for on-device speech enhancement without adding extra models. It leverages Dynamic Channel Pruning gating masks from a Conv-FSENet backbone and treats the masked activations as input features for lightweight linear/logistic predictors, achieving high accuracy with negligible computational overhead. The study demonstrates that a subset of about $C^\star \approx 202$ features suffices, with up to 93% VAD accuracy and $R^2 \approx 0.86$ for $F_0$, and provides visualizations (t-SNE, heatmaps) to illustrate the learned structure. Practically, this enables a single, efficient model to deliver SE while simultaneously estimating useful auxiliary attributes, benefiting robustness, privacy, and user experience on edge devices. ${}$Future work includes exploring non-linear predictors and joint training to further enhance performance and applicability.$

Abstract

Speech Enhancement (SE) in audio devices is often supported by auxiliary modules for Voice Activity Detection (VAD), SNR estimation, or Acoustic Scene Classification to ensure robust context-aware behavior and seamless user experience. Just like SE, these tasks often employ deep learning; however, deploying additional models on-device is computationally impractical, whereas cloud-based inference would introduce additional latency and compromise privacy. Prior work on SE employed Dynamic Channel Pruning (DynCP) to reduce computation by adaptively disabling specific channels based on the current input. In this work, we investigate whether useful signal properties can be estimated from these internal pruning masks, thus removing the need for separate models. We show that simple, interpretable predictors achieve up to 93% accuracy on VAD, 84% on noise classification, and an R2 of 0.86 on F0 estimation. With binary masks, predictions reduce to weighted sums, inducing negligible overhead. Our contribution is twofold: on one hand, we examine the emergent behavior of DynCP models through the lens of downstream prediction tasks, to reveal what they are learning; on the other, we repurpose and re-propose DynCP as a holistic solution for efficient SE and simultaneous estimation of signal properties.

From Diet to Free Lunch: Estimating Auxiliary Signal Properties using Dynamic Pruning Masks in Speech Enhancement Networks

TL;DR

This work addresses the challenge of providing auxiliary signal properties (e.g., VAD, noise category, ) for on-device speech enhancement without adding extra models. It leverages Dynamic Channel Pruning gating masks from a Conv-FSENet backbone and treats the masked activations as input features for lightweight linear/logistic predictors, achieving high accuracy with negligible computational overhead. The study demonstrates that a subset of about features suffices, with up to 93% VAD accuracy and for , and provides visualizations (t-SNE, heatmaps) to illustrate the learned structure. Practically, this enables a single, efficient model to deliver SE while simultaneously estimating useful auxiliary attributes, benefiting robustness, privacy, and user experience on edge devices. Future work includes exploring non-linear predictors and joint training to further enhance performance and applicability.$

Abstract

Speech Enhancement (SE) in audio devices is often supported by auxiliary modules for Voice Activity Detection (VAD), SNR estimation, or Acoustic Scene Classification to ensure robust context-aware behavior and seamless user experience. Just like SE, these tasks often employ deep learning; however, deploying additional models on-device is computationally impractical, whereas cloud-based inference would introduce additional latency and compromise privacy. Prior work on SE employed Dynamic Channel Pruning (DynCP) to reduce computation by adaptively disabling specific channels based on the current input. In this work, we investigate whether useful signal properties can be estimated from these internal pruning masks, thus removing the need for separate models. We show that simple, interpretable predictors achieve up to 93% accuracy on VAD, 84% on noise classification, and an R2 of 0.86 on F0 estimation. With binary masks, predictions reduce to weighted sums, inducing negligible overhead. Our contribution is twofold: on one hand, we examine the emergent behavior of DynCP models through the lens of downstream prediction tasks, to reveal what they are learning; on the other, we repurpose and re-propose DynCP as a holistic solution for efficient SE and simultaneous estimation of signal properties.
Paper Structure (9 sections, 4 equations, 6 figures, 1 table)

This paper contains 9 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of the proposed system, showing data generation pipeline, model, targets extraction, and prediction models.
  • Figure 2: Example generated data, showing clean/noise/enhanced audio, binary pruning masks, and a selection of ground truths; for relevant targets, we highlight regions with voice activity.
  • Figure 3: Performance on each task using different input features (colors); first 3.0 subplots show classification, last 3.0 subplots show regression.
  • Figure 4: Low-dimensional visualization of pruning masks, computed using t-SNE; for each subplot, points are colored by different targets.
  • Figure 5: Normalized coefficients (red for positive, blue for negative) for models trained on the top-64.0 most informative features (x-axis, grouped by processing block); showing a subset of targets (y-axis).
  • ...and 1 more figures