Table of Contents
Fetching ...

UniSync: A Unified Framework for Audio-Visual Synchronization

Tao Feng, Yifan Xie, Xun Guan, Jiyuan Song, Zhou Liu, Fei Ma, Fei Yu

TL;DR

UniSync tackles precise audio-visual lip synchronization by unifying diverse audio and visual representations into a shared embedding space. It employs a dual-stream architecture to map representations to embeddings and computes the synchronization probability $p_{sync} = \cos(\operatorname{ReLU}(a), \operatorname{ReLU}(v))$, with a margin-based contrastive loss and cross-speaker negatives to enforce robust separation. On LRS2 and CN-CVS, UniSync delivers state-of-the-art lip-sync accuracy (e.g., $94.27\%$ with HuBERT inputs) and enhances synchronization quality when integrated into talking-face generators like Wav2Lip and GeneFace. The approach demonstrates strong versatility across representation types and practical utility for both real-world and AI-generated content.

Abstract

Precise audio-visual synchronization in speech videos is crucial for content quality and viewer comprehension. Existing methods have made significant strides in addressing this challenge through rule-based approaches and end-to-end learning techniques. However, these methods often rely on limited audio-visual representations and suboptimal learning strategies, potentially constraining their effectiveness in more complex scenarios. To address these limitations, we present UniSync, a novel approach for evaluating audio-visual synchronization using embedding similarities. UniSync offers broad compatibility with various audio representations (e.g., Mel spectrograms, HuBERT) and visual representations (e.g., RGB images, face parsing maps, facial landmarks, 3DMM), effectively handling their significant dimensional differences. We enhance the contrastive learning framework with a margin-based loss component and cross-speaker unsynchronized pairs, improving discriminative capabilities. UniSync outperforms existing methods on standard datasets and demonstrates versatility across diverse audio-visual representations. Its integration into talking face generation frameworks enhances synchronization quality in both natural and AI-generated content.

UniSync: A Unified Framework for Audio-Visual Synchronization

TL;DR

UniSync tackles precise audio-visual lip synchronization by unifying diverse audio and visual representations into a shared embedding space. It employs a dual-stream architecture to map representations to embeddings and computes the synchronization probability , with a margin-based contrastive loss and cross-speaker negatives to enforce robust separation. On LRS2 and CN-CVS, UniSync delivers state-of-the-art lip-sync accuracy (e.g., with HuBERT inputs) and enhances synchronization quality when integrated into talking-face generators like Wav2Lip and GeneFace. The approach demonstrates strong versatility across representation types and practical utility for both real-world and AI-generated content.

Abstract

Precise audio-visual synchronization in speech videos is crucial for content quality and viewer comprehension. Existing methods have made significant strides in addressing this challenge through rule-based approaches and end-to-end learning techniques. However, these methods often rely on limited audio-visual representations and suboptimal learning strategies, potentially constraining their effectiveness in more complex scenarios. To address these limitations, we present UniSync, a novel approach for evaluating audio-visual synchronization using embedding similarities. UniSync offers broad compatibility with various audio representations (e.g., Mel spectrograms, HuBERT) and visual representations (e.g., RGB images, face parsing maps, facial landmarks, 3DMM), effectively handling their significant dimensional differences. We enhance the contrastive learning framework with a margin-based loss component and cross-speaker unsynchronized pairs, improving discriminative capabilities. UniSync outperforms existing methods on standard datasets and demonstrates versatility across diverse audio-visual representations. Its integration into talking face generation frameworks enhances synchronization quality in both natural and AI-generated content.

Paper Structure

This paper contains 18 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Four most commonly used visual representation methods in speech video analysis. From top to bottom: RGB images, face parsing map, facial landmarks, and 3DMM.
  • Figure 2: The architecture of UniSync. For a given video segment, the process begins with the extraction of feature vectors using selected visual and audio representation methods, yielding $v$ and $a$ respectively. These vectors then undergo initial refinement through specialized preprocessing layers ($\varepsilon_{1}$ for visual data and $\varepsilon_{3}$ for audio data). Following an average pooling operation to standardize dimensions, unified feature extraction layers ($\varepsilon_{2}$ for visual data and $\varepsilon_{4}$ for audio data) generate the final embeddings $v'$ and $a'$. The degree of synchronization between visual and audio content is subsequently determined by computing the cosine similarity ($p_{\text{sync}}$) between these refined embeddings.
  • Figure 3: Audio-visual contrastive learning samples: sync positives (matching audio and visual), same speaker negatives (with/without temporal overlap), and cross speaker negatives(mismatched speaker identity).