StyleLipSync: Style-based Personalized Lip-sync Video Generation

Taekyung Ki; Dongchan Min

StyleLipSync: Style-based Personalized Lip-sync Video Generation

Taekyung Ki, Dongchan Min

TL;DR

StyleLipSync tackles identity-agnostic lip-sync video generation from arbitrary audio by leveraging a StyleGAN-based decoder and a pose-aware masking strategy to preserve lip fidelity under dynamic pose. It introduces Style-aware Masked Fusion (SaMF) and Moving-average Latent Smoothing (MaLS) to ensure spatial fidelity and temporal coherence, respectively, while enabling zero-shot lip-sync performance. For unseen faces, a few-shot adaptation with a sync regularizer preserves audio generalization and enhances person-specific appearance without sacrificing lip-sync. Across VoxCeleb2 and HDTF, the method achieves state-of-the-art lip-sync metrics and competitive or superior image quality, with ablations confirming the contributions of pose-aware masking, SaMF, and MaLS; ethical considerations and potential misuse are discussed with proposed mitigations like watermarks.

Abstract

In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lip-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method.

StyleLipSync: Style-based Personalized Lip-sync Video Generation

TL;DR

Abstract

Paper Structure (18 sections, 10 equations, 7 figures, 4 tables)

This paper contains 18 sections, 10 equations, 7 figures, 4 tables.

Introduction
Related Works
Lip-sync Video Generation
GAN Prior
Personalization
Method
Pose-aware Masking
Decoder
Encoders
Training Objective
Unseen Face Adaptation
Experiments
Dataset
Implementation Details
Evaluation
...and 3 more sections

Figures (7)

Figure 1: A framework of StyleLipSync. We leverage a 3D parametric mesh predictor bfmmediapipe to obtain pose-aware masked frames $X_{1:T}$, which inherits the facial pose of input frames. Face encoder $\mathbf{E}_{face}$ maps $X_{1:T}$ into 2D spatial features and then fed into the decoder $\mathbf{G}$ through style-aware masked fusion ($\text{SaMF}$). Single reference image $X_{ref}$ and audio segments $A_{1:T}$ are mapped into latent space, followed by Moving-average based Latent Smoothing ($\text{MaLS}$). This module outputs smooth video latent codes $\tilde{w}_{1:T} \subseteq \mathcal{W}+$ that represent temporally consistent lip movement. With the guidance of $\text{SaMF}$s and the smooth video latent codes $\tilde{w}_{1:T}$, StyleLipSync can generate temporally consistent lip-synced videos.
Figure 2: Illustration of pose-aware masking. The expression parameter $\delta \in \mathbb{R}^{64}$ and the pose parameter $\tau \in \mathbb{R}^{3}, \gamma \in \text{SO}(3)$ are used to compute the natural mask.
Figure 3: Illustration of the decoder block. The encoded feature $\mathbf{E}_{face}^l(X_t)$ is injected into $l$-th decoder block through Style-aware Masked Fusion ($\text{SaMF}$). Note that only the convolutions in $\text{SaMF}$ are trainable, while the others are frozen during the training phase.
Figure 4: Adaptation for Unseen Face. We slightly tune the decoder $\mathbf{G}_{\theta}$ with the proposed sync regularizer $\mathcal{R}_{sync}$, while freezing all encoders' weight. Face encoder $\mathbf{E}_{face}$ and $\text{SaMF}$s are omitted here for simplicity.
Figure 5: Comparison with state-of-the-art methods. The different field of view comes from the pre-processing strategy of each model.
...and 2 more figures

StyleLipSync: Style-based Personalized Lip-sync Video Generation

TL;DR

Abstract

StyleLipSync: Style-based Personalized Lip-sync Video Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)