Table of Contents
Fetching ...

From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation

Ke Xue, Rongfei Fan, Lixin, Dawei Zhao, Chao Zhu, Han Hu

TL;DR

CSFNet tackles the cocktail-party-like challenge of audiovisual speech separation by introducing a recursive coarse-to-fine framework. The method first performs coarse separation using fused audio-visual cues, then reprocesses the coarse output with a pretrained AVSR model to extract richer, speaker-aware semantics for refinement. The approach is augmented by a speaker-aware fusion block and a multi-range spectro-temporal backbone, yielding state-of-the-art results on clean and noisy benchmarks and demonstrating robustness to visual occlusion. Collectively, the work shows that explicit refinement of semantic representations across modalities is key to superior multi-speaker separation and practical robustness.

Abstract

Audio-visual speech separation aims to isolate each speaker's clean voice from mixtures by leveraging visual cues such as lip movements and facial features. While visual information provides complementary semantic guidance, existing methods often underexploit its potential by relying on static visual representations. In this paper, we propose CSFNet, a Coarse-to-Separate-Fine Network that introduces a recursive semantic enhancement paradigm for more effective separation. CSFNet operates in two stages: (1) Coarse Separation, where a first-pass estimation reconstructs a coarse audio waveform from the mixture and visual input; and (2) Fine Separation, where the coarse audio is fed back into an audio-visual speech recognition (AVSR) model together with the visual stream. This recursive process produces more discriminative semantic representations, which are then used to extract refined audio. To further exploit these semantics, we design a speaker-aware perceptual fusion block to encode speaker identity across modalities, and a multi-range spectro-temporal separation network to capture both local and global time-frequency patterns. Extensive experiments on three benchmark datasets and two noisy datasets show that CSFNet achieves state-of-the-art (SOTA) performance, with substantial coarse-to-fine improvements, validating the necessity and effectiveness of our recursive semantic enhancement framework.

From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation

TL;DR

CSFNet tackles the cocktail-party-like challenge of audiovisual speech separation by introducing a recursive coarse-to-fine framework. The method first performs coarse separation using fused audio-visual cues, then reprocesses the coarse output with a pretrained AVSR model to extract richer, speaker-aware semantics for refinement. The approach is augmented by a speaker-aware fusion block and a multi-range spectro-temporal backbone, yielding state-of-the-art results on clean and noisy benchmarks and demonstrating robustness to visual occlusion. Collectively, the work shows that explicit refinement of semantic representations across modalities is key to superior multi-speaker separation and practical robustness.

Abstract

Audio-visual speech separation aims to isolate each speaker's clean voice from mixtures by leveraging visual cues such as lip movements and facial features. While visual information provides complementary semantic guidance, existing methods often underexploit its potential by relying on static visual representations. In this paper, we propose CSFNet, a Coarse-to-Separate-Fine Network that introduces a recursive semantic enhancement paradigm for more effective separation. CSFNet operates in two stages: (1) Coarse Separation, where a first-pass estimation reconstructs a coarse audio waveform from the mixture and visual input; and (2) Fine Separation, where the coarse audio is fed back into an audio-visual speech recognition (AVSR) model together with the visual stream. This recursive process produces more discriminative semantic representations, which are then used to extract refined audio. To further exploit these semantics, we design a speaker-aware perceptual fusion block to encode speaker identity across modalities, and a multi-range spectro-temporal separation network to capture both local and global time-frequency patterns. Extensive experiments on three benchmark datasets and two noisy datasets show that CSFNet achieves state-of-the-art (SOTA) performance, with substantial coarse-to-fine improvements, validating the necessity and effectiveness of our recursive semantic enhancement framework.

Paper Structure

This paper contains 33 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Previous vs. our audio-visual separation methods
  • Figure 2: The overall pipeline of CSFNet comprises two stages: coarse separation and fine separation. The coarse audio from the coarse separation stage is leveraged for fine separation, where the fine separation network is finetuned based on the coarse stage to produce refined audio.
  • Figure 3: The two main components of CSFNet: (a) SP fusion block, (b) MST separation module.
  • Figure 4: SI-SDRi under different numbers of missing visual cue frames for (a) one speaker, (b) two speakers on the LRS2-2Mix dataset.
  • Figure 5: Comparison of fusion strategies (SP vs. CC) and audio-visual separation methods under different mixing conditions on VoxCeleb2. SP denotes the speaker-wise perceptual fusion block, CC the simple concatenation, and “Miss 1” indicates that one speaker’s visual stream is missing. When a larger portion of visual input is absent, the advantage of any fusion strategy diminishes, rendering them less meaningful.
  • ...and 3 more figures