Table of Contents
Fetching ...

CueNet: Robust Audio-Visual Speaker Extraction through Cross-Modal Cue Mining and Interaction

Jiadong Wang, Ke Zhang, Xinyuan Qian, Ruijie Tao, Haizhou Li, Björn Schuller

TL;DR

This paper proposes an audio-visual learner that disentangles speaker information, acoustic synchronisation, and semantic synchronisation as distinct cues, and designs a dedicated interaction module that effectively integrates these cues to provide a reliable guidance signal for speaker extraction.

Abstract

Audio-visual speaker extraction has attracted increasing attention, as it removes the need for pre-registered speech and leverages the visual modality as a complement to audio. Although existing methods have achieved impressive performance, the issue of degraded visual inputs has received relatively little attention, despite being common in real-world scenarios. Previous attempts to address this problem have mainly involved training with degraded visual data. However, visual degradation can occur in many unpredictable ways, making it impractical to simulate all possible cases during training. In this paper, we aim to enhance the robustness of audio-visual speaker extraction against impaired visual inputs without relying on degraded videos during training. Inspired by observations from human perceptual mechanisms, we propose an audio-visual learner that disentangles speaker information, acoustic synchronisation, and semantic synchronisation as distinct cues. Furthermore, we design a dedicated interaction module that effectively integrates these cues to provide a reliable guidance signal for speaker extraction. Extensive experiments demonstrate the strong robustness of the proposed model under various visual degradations and its clear superiority over existing methods.

CueNet: Robust Audio-Visual Speaker Extraction through Cross-Modal Cue Mining and Interaction

TL;DR

This paper proposes an audio-visual learner that disentangles speaker information, acoustic synchronisation, and semantic synchronisation as distinct cues, and designs a dedicated interaction module that effectively integrates these cues to provide a reliable guidance signal for speaker extraction.

Abstract

Audio-visual speaker extraction has attracted increasing attention, as it removes the need for pre-registered speech and leverages the visual modality as a complement to audio. Although existing methods have achieved impressive performance, the issue of degraded visual inputs has received relatively little attention, despite being common in real-world scenarios. Previous attempts to address this problem have mainly involved training with degraded visual data. However, visual degradation can occur in many unpredictable ways, making it impractical to simulate all possible cases during training. In this paper, we aim to enhance the robustness of audio-visual speaker extraction against impaired visual inputs without relying on degraded videos during training. Inspired by observations from human perceptual mechanisms, we propose an audio-visual learner that disentangles speaker information, acoustic synchronisation, and semantic synchronisation as distinct cues. Furthermore, we design a dedicated interaction module that effectively integrates these cues to provide a reliable guidance signal for speaker extraction. Extensive experiments demonstrate the strong robustness of the proposed model under various visual degradations and its clear superiority over existing methods.
Paper Structure (16 sections, 8 equations, 5 figures, 6 tables)

This paper contains 16 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Pipeline of CueNet. The audio-visual learner performs cross-modal interaction between the audio feature $F_a$ and the lip-movement feature $F_v$, and hierarchically disentangles three cues: speaker, acoustic, and semantic cues. The cue interaction module then leverages these three cues to extract features belonging to the target speaker. Finally, the backend module reconstructs the target speech signal.
  • Figure 2: Low-level or high-level learners. If an entry contains a slash (/), the features before and after the slash correspond to those in the low-level and high-level learners, respectively.
  • Figure 3: The Cue Interaction Module. Three cues $C_s$, $C_a$, and $C_w$, are individually applied to extract features belonging to the target speaker. The cue-enhanced features are then dynamically fused. $R_a$ is the speech feature, denotes the speech feature, computed by summing $F_a$ and $F_{a\_h}$ in Fig. \ref{['fig:arch']}.
  • Figure 4: Attention cross the temporal dimension. There are 50 temporal frames.
  • Figure 5: Performance across different degree of corrpution. Maksed feature is applied to simulate visual degradation. "A", "S", "W" indicate acoustic, speaker and semantic.