UniSync: A Unified Framework for Audio-Visual Synchronization
Tao Feng, Yifan Xie, Xun Guan, Jiyuan Song, Zhou Liu, Fei Ma, Fei Yu
TL;DR
UniSync tackles precise audio-visual lip synchronization by unifying diverse audio and visual representations into a shared embedding space. It employs a dual-stream architecture to map representations to embeddings and computes the synchronization probability $p_{sync} = \cos(\operatorname{ReLU}(a), \operatorname{ReLU}(v))$, with a margin-based contrastive loss and cross-speaker negatives to enforce robust separation. On LRS2 and CN-CVS, UniSync delivers state-of-the-art lip-sync accuracy (e.g., $94.27\%$ with HuBERT inputs) and enhances synchronization quality when integrated into talking-face generators like Wav2Lip and GeneFace. The approach demonstrates strong versatility across representation types and practical utility for both real-world and AI-generated content.
Abstract
Precise audio-visual synchronization in speech videos is crucial for content quality and viewer comprehension. Existing methods have made significant strides in addressing this challenge through rule-based approaches and end-to-end learning techniques. However, these methods often rely on limited audio-visual representations and suboptimal learning strategies, potentially constraining their effectiveness in more complex scenarios. To address these limitations, we present UniSync, a novel approach for evaluating audio-visual synchronization using embedding similarities. UniSync offers broad compatibility with various audio representations (e.g., Mel spectrograms, HuBERT) and visual representations (e.g., RGB images, face parsing maps, facial landmarks, 3DMM), effectively handling their significant dimensional differences. We enhance the contrastive learning framework with a margin-based loss component and cross-speaker unsynchronized pairs, improving discriminative capabilities. UniSync outperforms existing methods on standard datasets and demonstrates versatility across diverse audio-visual representations. Its integration into talking face generation frameworks enhances synchronization quality in both natural and AI-generated content.
