A Fast and Lightweight Model for Causal Audio-Visual Speech Separation
Wendi Sang, Kai Li, Runxuan Yang, Jianqiang Huang, Xiaolin Hu
TL;DR
This work tackles real-time audio-visual speech separation by designing a causal, lightweight model, Swift-Net, that fuses visual lip cues with power-aware audio features via a LightVid block, a Frequency-Time Grouped SRU (FTGS), and a selective-attention fusion (SAF) module. By employing grouped SRUs and parameter sharing across stacked FTGS blocks, Swift-Net achieves real-time performance with significantly reduced parameters and MACs, while maintaining state-of-the-art separation on LRS2-2Mix, LRS3-2Mix, and VoxCeleb2-2Mix under causal constraints. The authors also provide a causal-design toolkit to convert non-causal AVSS models into causal variants for fair comparison. Overall, Swift-Net demonstrates strong practical potential for streaming AVSS in noisy environments, and the work includes ablations validating the effectiveness of each architectural component and design choice.
Abstract
Audio-visual speech separation (AVSS) aims to extract a target speech signal from a mixed signal by leveraging both auditory and visual (lip movement) cues. However, most existing AVSS methods exhibit complex architectures and rely on future context, operating offline, which renders them unsuitable for real-time applications. Inspired by the pipeline of RTFSNet, we propose a novel streaming AVSS model, named Swift-Net, which enhances the causal processing capabilities required for real-time applications. Swift-Net adopts a lightweight visual feature extraction module and an efficient fusion module for audio-visual integration. Additionally, Swift-Net employs Grouped SRUs to integrate historical information across different feature spaces, thereby improving the utilization efficiency of historical information. We further propose a causal transformation template to facilitate the conversion of non-causal AVSS models into causal counterparts. Experiments on three standard benchmark datasets (LRS2, LRS3, and VoxCeleb2) demonstrated that under causal conditions, our proposed Swift-Net exhibited outstanding performance, highlighting the potential of this method for processing speech in complex environments.
