Table of Contents
Fetching ...

UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

Ruidi Fan, Yang Zhou, Siyuan Wang, Tian Yu, Yutong Jiang, Xusheng Liu

TL;DR

UniSync is proposed, a unified framework designed for achieving high-fidelity lip synchronization in diverse scenarios that uses a mask-free pose-anchored training strategy to keep head motion and eliminate synthesis color artifacts, while employing mask-based blending consistent inference to ensure structural precision and smooth blending.

Abstract

Lip synchronization aims to generate realistic talking videos that match given audio, which is essential for high-quality video dubbing. However, current methods have fundamental drawbacks: mask-based approaches suffer from local color discrepancies, while mask-free methods struggle with global background texture misalignment. Furthermore, most methods struggle with diverse real-world scenarios such as stylized avatars, face occlusion, and extreme lighting conditions. In this paper, we propose UniSync, a unified framework designed for achieving high-fidelity lip synchronization in diverse scenarios. Specifically, UniSync uses a mask-free pose-anchored training strategy to keep head motion and eliminate synthesis color artifacts, while employing mask-based blending consistent inference to ensure structural precision and smooth blending. Notably, fine-tuning on compact but diverse videos empowers our model with exceptional domain adaptability, handling complex corner cases effectively. We also introduce the RealWorld-LipSync benchmark to evaluate models under real-world demands, which covers diverse application scenarios including both human faces and stylized avatars. Extensive experiments demonstrate that UniSync significantly outperforms state-of-the-art methods, advancing the field towards truly generalizable and production-ready lip synchronization.

UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

TL;DR

UniSync is proposed, a unified framework designed for achieving high-fidelity lip synchronization in diverse scenarios that uses a mask-free pose-anchored training strategy to keep head motion and eliminate synthesis color artifacts, while employing mask-based blending consistent inference to ensure structural precision and smooth blending.

Abstract

Lip synchronization aims to generate realistic talking videos that match given audio, which is essential for high-quality video dubbing. However, current methods have fundamental drawbacks: mask-based approaches suffer from local color discrepancies, while mask-free methods struggle with global background texture misalignment. Furthermore, most methods struggle with diverse real-world scenarios such as stylized avatars, face occlusion, and extreme lighting conditions. In this paper, we propose UniSync, a unified framework designed for achieving high-fidelity lip synchronization in diverse scenarios. Specifically, UniSync uses a mask-free pose-anchored training strategy to keep head motion and eliminate synthesis color artifacts, while employing mask-based blending consistent inference to ensure structural precision and smooth blending. Notably, fine-tuning on compact but diverse videos empowers our model with exceptional domain adaptability, handling complex corner cases effectively. We also introduce the RealWorld-LipSync benchmark to evaluate models under real-world demands, which covers diverse application scenarios including both human faces and stylized avatars. Extensive experiments demonstrate that UniSync significantly outperforms state-of-the-art methods, advancing the field towards truly generalizable and production-ready lip synchronization.
Paper Structure (20 sections, 12 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 12 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison with OmniSync on challenging real-world scenarios. While OmniSync shows degraded quality under low-light conditions and occlusions, it completely fails on stylized cartoons, producing no lip modifications. Our method robustly handles all cases with accurate synchronization and high visual fidelity.
  • Figure 2: Training framework of the proposed method. The Pose-Anchored Fidelity Strategy (PAFS) enforces a direct mapping between pose variation and facial motion. LoRA-based fine-tuning makes the model better adapt to pose-anchored, mask-free training efficiently.
  • Figure 4: Quantitative evaluation results compared with recent SOTA methods MuseTalk, LatentSync, and OmniSync. Our method achieves the most accurate lip synchronization and demonstrates superior robustness across diverse real-world production scenarios.
  • Figure 5: Ablation Study for PAFS, TALI and Gaussian smooth.