Table of Contents
Fetching ...

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

Zhaoxi Mu, Xinyu Yang

TL;DR

This work tackles modality imbalance in audio-visual target speech extraction by introducing AVSepChain, a two-stage framework that mimics the speech chain through a perception stage (audio-dominant with visual conditioning) and a production stage (visual-dominant with audio conditioning). It leverages AV-HuBERT for lip representation, AV-Sepformer-based AV-Sepator for separation, and an AV-Synthesizer that predicts a residual signal to refine speech while enforcing semantic alignment via a contrastive loss between pseudo-phonemes and pseudo-visemes. The approach achieves state-of-the-art results on LRS2-2Mix and VoxCeleb2-2Mix, with strong cross-domain generalization to LRS3 and TCD-TIMIT and comprehensive ablations validating each component (production stage, semantic matching, residual signaling, visual-front-end choices, and modulation strategies). Overall, AVSepChain improves perceptual speech quality and downstream ASR performance, demonstrating a practical method to mitigate modality imbalance in audio-visual speech processing and enabling more robust AV-TSE systems.

Abstract

The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality imbalance. Additionally, we introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements during the speech production stage. Through extensive experiments conducted on multiple benchmark datasets for audio-visual target speech extraction, we showcase the superior performance achieved by our proposed method.

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

TL;DR

This work tackles modality imbalance in audio-visual target speech extraction by introducing AVSepChain, a two-stage framework that mimics the speech chain through a perception stage (audio-dominant with visual conditioning) and a production stage (visual-dominant with audio conditioning). It leverages AV-HuBERT for lip representation, AV-Sepformer-based AV-Sepator for separation, and an AV-Synthesizer that predicts a residual signal to refine speech while enforcing semantic alignment via a contrastive loss between pseudo-phonemes and pseudo-visemes. The approach achieves state-of-the-art results on LRS2-2Mix and VoxCeleb2-2Mix, with strong cross-domain generalization to LRS3 and TCD-TIMIT and comprehensive ablations validating each component (production stage, semantic matching, residual signaling, visual-front-end choices, and modulation strategies). Overall, AVSepChain improves perceptual speech quality and downstream ASR performance, demonstrating a practical method to mitigate modality imbalance in audio-visual speech processing and enabling more robust AV-TSE systems.

Abstract

The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality imbalance. Additionally, we introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements during the speech production stage. Through extensive experiments conducted on multiple benchmark datasets for audio-visual target speech extraction, we showcase the superior performance achieved by our proposed method.
Paper Structure (28 sections, 6 equations, 2 figures, 6 tables)

This paper contains 28 sections, 6 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The schematic diagram of information flow within the speech chain. The speech chain captures the movement of information between the speaker and the perceiver during speech communication, encompassing the processes of speech production and speech perception.
  • Figure 2: The overall framework of AVSepChain encompasses two stages: speech perception and speech production. In the speech perception stage, the AV-Separator initially extracts the target speaker's speech. In the speech production stage, the AV-Synthesizer predicts the residual signal of the output from the speech perception stage. In the speech perception stage, audio is treated as the dominant modality, while visual information serves as the conditional modality. This relationship is reversed in the speech production stage. AV-HuBERT and HuBERT, depicted in the solid line box, have their parameters fixed during training. The AV-Separator and AV-Synthesizer, shown in the dotted box, have their parameters updated during training. The embeddings extracted by AV-HuBERT and HuBERT are utilized to calculate the contrastive modality matching loss.