Table of Contents
Fetching ...

Knowing When to Quit: Probabilistic Early Exits for Speech Separation

Kenny Falkær Olsen, Mads Østergaard, Karl Ulbæk, Søren Føns Nielsen, Rasmus Malik Høegh Lindrup, Bjørn Sand Jensen, Morten Mørup

TL;DR

This work designs a neural network architecture for speech separation and enhancement capable of early-exit, and proposes an uncertainty-aware probabilistic framework to jointly model the clean speech signal and error variance which is used to derive probabilistic early-exit conditions in terms of desired signal-to-noise ratios.

Abstract

In recent years, deep learning-based single-channel speech separation has improved considerably, in large part driven by increasingly compute- and parameter-efficient neural network architectures. Most such architectures are, however, designed with a fixed compute and parameter budget and consequently cannot scale to varying compute demands or resources, which limits their use in embedded and heterogeneous devices such as mobile phones and hearables. To enable such use-cases we design a neural network architecture for speech separation and enhancement capable of early-exit, and we propose an uncertainty-aware probabilistic framework to jointly model the clean speech signal and error variance which we use to derive probabilistic early-exit conditions in terms of desired signal-to-noise ratios. We evaluate our methods on both speech separation and enhancement tasks where we demonstrate that early-exit capabilities can be introduced without compromising reconstruction, and that when trained on variable-length audio our early-exit conditions are well-calibrated and lead to considerable compute savings when used to dynamically scale compute at test time while remaining directly interpretable.

Knowing When to Quit: Probabilistic Early Exits for Speech Separation

TL;DR

This work designs a neural network architecture for speech separation and enhancement capable of early-exit, and proposes an uncertainty-aware probabilistic framework to jointly model the clean speech signal and error variance which is used to derive probabilistic early-exit conditions in terms of desired signal-to-noise ratios.

Abstract

In recent years, deep learning-based single-channel speech separation has improved considerably, in large part driven by increasingly compute- and parameter-efficient neural network architectures. Most such architectures are, however, designed with a fixed compute and parameter budget and consequently cannot scale to varying compute demands or resources, which limits their use in embedded and heterogeneous devices such as mobile phones and hearables. To enable such use-cases we design a neural network architecture for speech separation and enhancement capable of early-exit, and we propose an uncertainty-aware probabilistic framework to jointly model the clean speech signal and error variance which we use to derive probabilistic early-exit conditions in terms of desired signal-to-noise ratios. We evaluate our methods on both speech separation and enhancement tasks where we demonstrate that early-exit capabilities can be introduced without compromising reconstruction, and that when trained on variable-length audio our early-exit conditions are well-calibrated and lead to considerable compute savings when used to dynamically scale compute at test time while remaining directly interpretable.

Paper Structure

This paper contains 34 sections, 19 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Reconstructed spectrograms of two speakers from the WSJ0-2mix test set separated by a PRESS-4-S model with 4 exit points evaluated in segments of $T=2000$ samples, showing our proposed exit-SNR exit condition evaluated for each segment with a target level of $22$ dB (shown in red). The distributions of each exit-SNR condition is shown shaded by exit point, demonstrating non-trivial improvement for deeper exits. An extended version showing our other SNR-like distributions can be seen in \ref{['sec:full_demo']}, \ref{['fig:early_exit_demo_full']}.
  • Figure 2: Detailed architecture of this-Net. It consists of three parts: an encoder, an early split module and a reconstruction decoder with the ability to reconstruct early.
  • Figure 3: Source separation performance on WSJ0-2mix in terms of sisnri per compute (gmacs), with the area of points corresponding to parameter count of models. The static performance of every exit point is shown for this models, as well as the dynamic performance of the PRESS-4 (S) model using our probabilistic exit condition for varying target levels, beating the static performance curve in efficiency. We also include the performance of single-exit models, which underperform the jointly trained model at deeper exits.
  • Figure 4: One-sided exit-SNR regret on the WSJ0-2mix test set for a PRESS-4 (S) model trained with a block size of 2000 samples using different early-exit strategies with target levels of $t=20,25,30$ dB: (dynamic) our probabilistic exit strategy in \ref{['eq:exit_snr']} evaluated for varying confidence thresholds $p$, (static) using a single exit for all blocks, (oracle) a best-case strategy that always exits when the target is achieved using the ground-truth exit-SNR, (uniform) an uninformed strategy that selects exit points uniformly at random.
  • Figure 5: Calibration curves for the predicted $\sigma_i^2$ mean error distributions on the WSJ0-2mix test set for a PRESS-4 (S) model with a block size of 2000 samples. In (a) and (b) we see that the distributions are uncalibrated when the model is trained on 4-second clips and evaluated on full-length sequences on both training and test data. In (c) and (d) we see that the model predictions become well-calibrated on both training and test data after finetuning on full-length training data.
  • ...and 3 more figures