Table of Contents
Fetching ...

$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction

Wenxuan Wu, Xueyuan Chen, Shuai Wang, Jiadong Wang, Lingwei Meng, Xixin Wu, Helen Meng, Haizhou Li

TL;DR

This work tackles AV-TSE by introducing context- and confidence-aware mechanisms. The Mask-And-Recover (MAR) strategy incorporates intra-speech context and target lip movements to provide global extraction cues, while the Fine-Grained Confidence Score (FCS) model identifies unreliable segments for targeted refinement. A two-stage fine-tuning pipeline—global fine-tuning followed by confidence-aware fine-tuning (including self-supervised and supervised variants)—proves model-agnostic, improving six representative AV-TSE backbones on VoxCeleb2. The approach yields consistent gains across multiple metrics and demonstrates robustness to visual impairments, underscoring the practical value of context and confidence cues in real-world audio-visual speech processing.

Abstract

Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues. Although numerous models have been proposed recently, most of them estimate target signals by primarily relying on local dependencies within acoustic features, underutilizing the human-like capacity to infer unclear parts of speech through contextual information. This limitation results in not only suboptimal performance but also inconsistent extraction quality across the utterance, with some segments exhibiting poor quality or inadequate suppression of interfering speakers. To close this gap, we propose a model-agnostic strategy called the Mask-And-Recover (MAR). It integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. Additionally, to better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model to assess extraction quality and guide extraction modules to emphasize improvement on low-quality segments. To validate the effectiveness of our proposed model-agnostic training paradigm, six popular AV-TSE backbones were adopted for evaluation on the VoxCeleb2 dataset, demonstrating consistent performance improvements across various metrics.

$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction

TL;DR

This work tackles AV-TSE by introducing context- and confidence-aware mechanisms. The Mask-And-Recover (MAR) strategy incorporates intra-speech context and target lip movements to provide global extraction cues, while the Fine-Grained Confidence Score (FCS) model identifies unreliable segments for targeted refinement. A two-stage fine-tuning pipeline—global fine-tuning followed by confidence-aware fine-tuning (including self-supervised and supervised variants)—proves model-agnostic, improving six representative AV-TSE backbones on VoxCeleb2. The approach yields consistent gains across multiple metrics and demonstrates robustness to visual impairments, underscoring the practical value of context and confidence cues in real-world audio-visual speech processing.

Abstract

Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues. Although numerous models have been proposed recently, most of them estimate target signals by primarily relying on local dependencies within acoustic features, underutilizing the human-like capacity to infer unclear parts of speech through contextual information. This limitation results in not only suboptimal performance but also inconsistent extraction quality across the utterance, with some segments exhibiting poor quality or inadequate suppression of interfering speakers. To close this gap, we propose a model-agnostic strategy called the Mask-And-Recover (MAR). It integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. Additionally, to better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model to assess extraction quality and guide extraction modules to emphasize improvement on low-quality segments. To validate the effectiveness of our proposed model-agnostic training paradigm, six popular AV-TSE backbones were adopted for evaluation on the VoxCeleb2 dataset, demonstrating consistent performance improvements across various metrics.

Paper Structure

This paper contains 44 sections, 5 equations, 10 figures, 3 tables, 2 algorithms.

Figures (10)

  • Figure 1: The unreliable segments are from AV-TSE results. $v$ and $x$ denote the target visual cue and mixture speech signal, respectively. One unreliable extraction segment is indicated with dotted rectangles, where the interfering speech signal is in light green, and the target speech signal is in red.
  • Figure 2: Contextual cues in AV-TSE. The current frame is denoted by a blue rectangle. In addition to the corresponding visual cue in the blue rectangle, the target visual context and target speech context also serve as additional cues for the extraction. In contrast, the green signal denotes the sum of interfering speech, which may harm the extraction performance.
  • Figure 3: Illustration of MAR strategy. The input to the speaker extractor includes intact visual cues $v$ and masked mixture speech signal $x$. The output of the speaker extractor includes extracted speech embedding $X$ (shown in orange) and corresponding visual cue embedding $V$ (shown in blue). To recover the masked region $X_{mask}$, both the intra-modality context from target speech context $X_{ctx}$ as well as inter-modality context from $V$ will contribute. Here, the temporal synchronized visual cue of $X_{mask}$ serves as a direct visual cue, and the remaining visual frames serve as visual contextual cues. To distinguish different levels of contribution, the relevance of the context is represented by curves of varying thickness. By modeling both types of contextual information during extraction, the learned contextual correlation will be injected into the speaker extractor as additional extraction cues.
  • Figure 4: Fine-Grained Confidence Score (FCS) Prediction model
  • Figure 5: Illustration of the proposed two-stage fine-tuning strategy. Stage 1: (a) Global fine-tuning, utilizing the vanilla MAR strategy with randomly masked mixture speech $x$ and intact visual cue $v$. All modules except the MAR block are initiated from the pre-trained AV-TSE model. All modules will be fine-tuned excluding the visual encoder. $I, \overline{I}$ denote the mask automatically detected from $x$. Stage 2: (b, c) Confidence-aware fine-tuning. All modules are initiated from stage 1. A frozen pre-trained FCS model is integrated to detect unreliable extraction segments. (b) Self-supervised fine-tuning, similar to Stage 1, but mask segments of $x$ based on predicted confidence scores. The first forward, shown in the dotted line, stops the gradient to infer the masked region. The second forward, shown in blue, is self-supervised fine-tuning with masked mixture input. (c) Supervised fine-tuning, no masked mixture input, and MAR block for recovery loss. Two variants are considered, left: full fine-tuning on all modules. Right: fine-tuning the adapter only and freezing all other modules.
  • ...and 5 more figures