Table of Contents
Fetching ...

A Fast and Lightweight Model for Causal Audio-Visual Speech Separation

Wendi Sang, Kai Li, Runxuan Yang, Jianqiang Huang, Xiaolin Hu

TL;DR

This work tackles real-time audio-visual speech separation by designing a causal, lightweight model, Swift-Net, that fuses visual lip cues with power-aware audio features via a LightVid block, a Frequency-Time Grouped SRU (FTGS), and a selective-attention fusion (SAF) module. By employing grouped SRUs and parameter sharing across stacked FTGS blocks, Swift-Net achieves real-time performance with significantly reduced parameters and MACs, while maintaining state-of-the-art separation on LRS2-2Mix, LRS3-2Mix, and VoxCeleb2-2Mix under causal constraints. The authors also provide a causal-design toolkit to convert non-causal AVSS models into causal variants for fair comparison. Overall, Swift-Net demonstrates strong practical potential for streaming AVSS in noisy environments, and the work includes ablations validating the effectiveness of each architectural component and design choice.

Abstract

Audio-visual speech separation (AVSS) aims to extract a target speech signal from a mixed signal by leveraging both auditory and visual (lip movement) cues. However, most existing AVSS methods exhibit complex architectures and rely on future context, operating offline, which renders them unsuitable for real-time applications. Inspired by the pipeline of RTFSNet, we propose a novel streaming AVSS model, named Swift-Net, which enhances the causal processing capabilities required for real-time applications. Swift-Net adopts a lightweight visual feature extraction module and an efficient fusion module for audio-visual integration. Additionally, Swift-Net employs Grouped SRUs to integrate historical information across different feature spaces, thereby improving the utilization efficiency of historical information. We further propose a causal transformation template to facilitate the conversion of non-causal AVSS models into causal counterparts. Experiments on three standard benchmark datasets (LRS2, LRS3, and VoxCeleb2) demonstrated that under causal conditions, our proposed Swift-Net exhibited outstanding performance, highlighting the potential of this method for processing speech in complex environments.

A Fast and Lightweight Model for Causal Audio-Visual Speech Separation

TL;DR

This work tackles real-time audio-visual speech separation by designing a causal, lightweight model, Swift-Net, that fuses visual lip cues with power-aware audio features via a LightVid block, a Frequency-Time Grouped SRU (FTGS), and a selective-attention fusion (SAF) module. By employing grouped SRUs and parameter sharing across stacked FTGS blocks, Swift-Net achieves real-time performance with significantly reduced parameters and MACs, while maintaining state-of-the-art separation on LRS2-2Mix, LRS3-2Mix, and VoxCeleb2-2Mix under causal constraints. The authors also provide a causal-design toolkit to convert non-causal AVSS models into causal variants for fair comparison. Overall, Swift-Net demonstrates strong practical potential for streaming AVSS in noisy environments, and the work includes ablations validating the effectiveness of each architectural component and design choice.

Abstract

Audio-visual speech separation (AVSS) aims to extract a target speech signal from a mixed signal by leveraging both auditory and visual (lip movement) cues. However, most existing AVSS methods exhibit complex architectures and rely on future context, operating offline, which renders them unsuitable for real-time applications. Inspired by the pipeline of RTFSNet, we propose a novel streaming AVSS model, named Swift-Net, which enhances the causal processing capabilities required for real-time applications. Swift-Net adopts a lightweight visual feature extraction module and an efficient fusion module for audio-visual integration. Additionally, Swift-Net employs Grouped SRUs to integrate historical information across different feature spaces, thereby improving the utilization efficiency of historical information. We further propose a causal transformation template to facilitate the conversion of non-causal AVSS models into causal counterparts. Experiments on three standard benchmark datasets (LRS2, LRS3, and VoxCeleb2) demonstrated that under causal conditions, our proposed Swift-Net exhibited outstanding performance, highlighting the potential of this method for processing speech in complex environments.

Paper Structure

This paper contains 17 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Diagram for segment-based causal adaptive average pooling layer. For clarity of presentation, we take an input sequence of length $T=8$ that is evenly divided into $S=4$ segments as an example. Here, avg denotes the averaging operation. At $t=1$, the computation begins when $x_1$ arrives; at $t=3$, the computation begins when $x_3$ arrives, and so on.
  • Figure 2: The overall pipeline of Swift-Net. The yellow line and the purple line represent the flow of visual features and audio features respectively. The $\odot$ symbol denotes element-wise multiplication in the complex domain.
  • Figure 3: The structural diagram of LightVid block. Here, $c_v$ denote the number of channels of the visual features, and $T_v$ represents the number of frames of the visual features.
  • Figure 4: The structural diagram of FTGS block. Bi-GSRU denotes the Bidirectional Grouped SRU. UNi-GSRU denotes the Unidirectional Grouped SRU.
  • Figure 5: The structural diagram of SAF block. The yellow line and the purple line represent the flow of visual features and audio features respectively. $\phi$ represents nearest neighbor interpolation. $\rho$ represents the flattening of the last dimension of the tensor.