Table of Contents
Fetching ...

SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation

Zeyu Ling, Xiaodong Gu, Jiangnan Tang, Changqing Zou

TL;DR

SyncLipMAE addresses the need for a token-level, synchronization-aware representation of talking-face video by introducing three per-frame prompt tokens (identity, vocal motion, ambient motion) and a cross-modal objective that aligns time-aligned vocal-motion tokens with audio tokens in a shared embedding space. It combines two-view masked visual reconstruction with a CLIP-style audio–visual contrastive loss and a decorrelation term to disentangle the factors, using a two-pass decoder conditioned on distinct prompts. The approach yields a unified interface suitable for AV synchronization, facial understanding, visual speech recognition, and visual dubbing, achieving state-of-the-art results across multiple datasets. The practical impact lies in enabling a single, adaptable model to analyze and generate synchronized talking-face content with audio- or video-driven control, advancing both research and real-world applications in multimodal video processing.

Abstract

We introduce SyncLipMAE, a self-supervised pretraining framework for talking-face video that learns synchronization-aware and transferable facial dynamics from unlabeled audio-visual streams. Our approach couples masked visual modeling with cross-modal contrastive alignment and employs three per-frame prompt tokens that explicitly encode the essential factors of a talking-face frame - identity, vocal motion (speech-synchronized facial dynamics), and ambient motion (audio-agnostic movements such as blinks and head pose). The contrastive objective uses time-aligned vocal-motion and audio tokens as positives and misaligned pairs as negatives, driving both modalities into a shared embedding space and yielding token-level audio-visual stream synchronization. After pretraining, the aligned audio tokens together with the visual prompt tokens (identity, vocal motion, ambient motion) form a unified interface for four disparate downstream settings: (i) audio-visual stream synchronization; (ii) facial emotion and head/face action recognition; (iii) visual speech recognition; and (iv) visual dubbing, for which we enable indistinguishable audio- or video-driven control within a single model. Across four task families that require distinct capabilities, SyncLipMAE achieves state-of-the-art results, underscoring the effectiveness of synchronization-aware, factorized self-supervised pretraining.

SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation

TL;DR

SyncLipMAE addresses the need for a token-level, synchronization-aware representation of talking-face video by introducing three per-frame prompt tokens (identity, vocal motion, ambient motion) and a cross-modal objective that aligns time-aligned vocal-motion tokens with audio tokens in a shared embedding space. It combines two-view masked visual reconstruction with a CLIP-style audio–visual contrastive loss and a decorrelation term to disentangle the factors, using a two-pass decoder conditioned on distinct prompts. The approach yields a unified interface suitable for AV synchronization, facial understanding, visual speech recognition, and visual dubbing, achieving state-of-the-art results across multiple datasets. The practical impact lies in enabling a single, adaptable model to analyze and generate synchronized talking-face content with audio- or video-driven control, advancing both research and real-world applications in multimodal video processing.

Abstract

We introduce SyncLipMAE, a self-supervised pretraining framework for talking-face video that learns synchronization-aware and transferable facial dynamics from unlabeled audio-visual streams. Our approach couples masked visual modeling with cross-modal contrastive alignment and employs three per-frame prompt tokens that explicitly encode the essential factors of a talking-face frame - identity, vocal motion (speech-synchronized facial dynamics), and ambient motion (audio-agnostic movements such as blinks and head pose). The contrastive objective uses time-aligned vocal-motion and audio tokens as positives and misaligned pairs as negatives, driving both modalities into a shared embedding space and yielding token-level audio-visual stream synchronization. After pretraining, the aligned audio tokens together with the visual prompt tokens (identity, vocal motion, ambient motion) form a unified interface for four disparate downstream settings: (i) audio-visual stream synchronization; (ii) facial emotion and head/face action recognition; (iii) visual speech recognition; and (iv) visual dubbing, for which we enable indistinguishable audio- or video-driven control within a single model. Across four task families that require distinct capabilities, SyncLipMAE achieves state-of-the-art results, underscoring the effectiveness of synchronization-aware, factorized self-supervised pretraining.

Paper Structure

This paper contains 43 sections, 19 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Panel (a) schematically illustrates the core components of SyncLipMAE and the computational pipeline used during pretraining, while panel (b) shows its adaptation to downstream tasks.
  • Figure 2: Qualitative visual comparison: LatentSync vs. our approach.