Table of Contents
Fetching ...

Modulating State Space Model with SlowFast Framework for Compute-Efficient Ultra Low-Latency Speech Enhancement

Longbiao Cheng, Ashutosh Pandey, Buye Xu, Tobi Delbruck, Vamsi Krishna Ithapu, Shih-Chii Liu

TL;DR

The paper tackles the high computational burden of achieving ultra-low-latency speech enhancement by proposing a Slow Fast framework that splits processing into a slow, low-frame-rate analysis branch and a fast, time-domain enhancement branch. The fast branch is a state-space model whose transitions are dynamically modulated by outputs from the slow branch through a lightweight SSMM mechanism, with a diagonal state-transition matrix to keep parameters small. Experiments on Voice Bank + DEMAND show a 70% reduction in compute at a 2 ms latency while maintaining performance, and demonstrate a near-real-time capability with a 62.5 μs latency on edge hardware when configured for single-sample latency, using only 16 parameters in the fast branch. The framework also enables a potential remote-slow/edge-fast deployment model, offering substantial practical impact for low-latency speech processing on resource-constrained devices.

Abstract

Deep learning-based speech enhancement (SE) methods often face significant computational challenges when needing to meet low-latency requirements because of the increased number of frames to be processed. This paper introduces the SlowFast framework which aims to reduce computation costs specifically when low-latency enhancement is needed. The framework consists of a slow branch that analyzes the acoustic environment at a low frame rate, and a fast branch that performs SE in the time domain at the needed higher frame rate to match the required latency. Specifically, the fast branch employs a state space model where its state transition process is dynamically modulated by the slow branch. Experiments on a SE task with a 2 ms algorithmic latency requirement using the Voice Bank + Demand dataset show that our approach reduces computation cost by 70% compared to a baseline single-branch network with equivalent parameters, without compromising enhancement performance. Furthermore, by leveraging the SlowFast framework, we implemented a network that achieves an algorithmic latency of just 62.5 μs (one sample point at 16 kHz sample rate) with a computation cost of 100 M MACs/s, while scoring a PESQ-NB of 3.12 and SISNR of 16.62.

Modulating State Space Model with SlowFast Framework for Compute-Efficient Ultra Low-Latency Speech Enhancement

TL;DR

The paper tackles the high computational burden of achieving ultra-low-latency speech enhancement by proposing a Slow Fast framework that splits processing into a slow, low-frame-rate analysis branch and a fast, time-domain enhancement branch. The fast branch is a state-space model whose transitions are dynamically modulated by outputs from the slow branch through a lightweight SSMM mechanism, with a diagonal state-transition matrix to keep parameters small. Experiments on Voice Bank + DEMAND show a 70% reduction in compute at a 2 ms latency while maintaining performance, and demonstrate a near-real-time capability with a 62.5 μs latency on edge hardware when configured for single-sample latency, using only 16 parameters in the fast branch. The framework also enables a potential remote-slow/edge-fast deployment model, offering substantial practical impact for low-latency speech processing on resource-constrained devices.

Abstract

Deep learning-based speech enhancement (SE) methods often face significant computational challenges when needing to meet low-latency requirements because of the increased number of frames to be processed. This paper introduces the SlowFast framework which aims to reduce computation costs specifically when low-latency enhancement is needed. The framework consists of a slow branch that analyzes the acoustic environment at a low frame rate, and a fast branch that performs SE in the time domain at the needed higher frame rate to match the required latency. Specifically, the fast branch employs a state space model where its state transition process is dynamically modulated by the slow branch. Experiments on a SE task with a 2 ms algorithmic latency requirement using the Voice Bank + Demand dataset show that our approach reduces computation cost by 70% compared to a baseline single-branch network with equivalent parameters, without compromising enhancement performance. Furthermore, by leveraging the SlowFast framework, we implemented a network that achieves an algorithmic latency of just 62.5 μs (one sample point at 16 kHz sample rate) with a computation cost of 100 M MACs/s, while scoring a PESQ-NB of 3.12 and SISNR of 16.62.

Paper Structure

This paper contains 8 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Illustration of proposed Slow Fast framework for compute-efficient low-latency speech enhancement. (A) Slow Fast Processing when $\delta=3$. : The slow branch (orange, bottom) operates at a lower frame rate, the fast branch (blue, top) operates at a higher rate. (B) Framing and OLA Process: The slow branch processes longer segments with a larger hop size, while the fast branch processes shorter segments with a smaller hop size. The enhanced speech is obtained by doing the ola olaola on the fast branch outputs. (C) SSM Modulation: The slow branch modulates the state transition process in the fast branch.
  • Figure 2: Two other methods investigated in this work for integrating the Slow and Fast branches. (A) Embedding Concatenation: The output of the slow branch is used as an additional feature in fast branch. (B) Feature-wise Linear Modulation: The slow branch generates two vectors that scale and shift the fast branch features.