StyleLipSync: Style-based Personalized Lip-sync Video Generation
Taekyung Ki, Dongchan Min
TL;DR
StyleLipSync tackles identity-agnostic lip-sync video generation from arbitrary audio by leveraging a StyleGAN-based decoder and a pose-aware masking strategy to preserve lip fidelity under dynamic pose. It introduces Style-aware Masked Fusion (SaMF) and Moving-average Latent Smoothing (MaLS) to ensure spatial fidelity and temporal coherence, respectively, while enabling zero-shot lip-sync performance. For unseen faces, a few-shot adaptation with a sync regularizer preserves audio generalization and enhances person-specific appearance without sacrificing lip-sync. Across VoxCeleb2 and HDTF, the method achieves state-of-the-art lip-sync metrics and competitive or superior image quality, with ablations confirming the contributions of pose-aware masking, SaMF, and MaLS; ethical considerations and potential misuse are discussed with proposed mitigations like watermarks.
Abstract
In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lip-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method.
