Table of Contents
Fetching ...

Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

Runyi Yu, Tianyu He, Ailing Zhang, Yuchi Wang, Junliang Guo, Xu Tan, Chang Liu, Jie Chen, Jiang Bian

TL;DR

This work tackles lip-sync editing in talking videos while preserving personal identity and visual details. It introduces MyTalk, a two-stage framework that first uses a speech-driven diffusion model to generate facial landmarks conditioned on speech and identity, then synthesizes appearance via a motion-conditioned generator that separately encodes lip and non-lip appearance along with motion, fused by a FusionNet. Key contributions include a landmark-based identity loss for motion generation, a multi-encoder appearance model with a learned fusion mechanism, and training on a large, diverse dataset enabling generalization to unseen identities and controllable editing of appearance and emotion. The approach achieves superior lip-sync accuracy and visual fidelity compared with one-stage baselines, offering practical benefits for high-fidelity, identity-preserving video editing in AI-generated content.

Abstract

We aim to edit the lip movements in talking video according to the given speech while preserving the personal identity and visual details. The task can be decomposed into two sub-problems: (1) speech-driven lip motion generation and (2) visual appearance synthesis. Current solutions handle the two sub-problems within a single generative model, resulting in a challenging trade-off between lip-sync quality and visual details preservation. Instead, we propose to disentangle the motion and appearance, and then generate them one by one with a speech-to-motion diffusion model and a motion-conditioned appearance generation model. However, there still remain challenges in each stage, such as motion-aware identity preservation in (1) and visual details preservation in (2). Therefore, to preserve personal identity, we adopt landmarks to represent the motion, and further employ a landmark-based identity loss. To capture motion-agnostic visual details, we use separate encoders to encode the lip, non-lip appearance and motion, and then integrate them with a learned fusion module. We train MyTalk on a large-scale and diverse dataset. Experiments show that our method generalizes well to the unknown, even out-of-domain person, in terms of both lip sync and visual detail preservation. We encourage the readers to watch the videos on our project page (https://Ingrid789.github.io/MyTalk/).

Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

TL;DR

This work tackles lip-sync editing in talking videos while preserving personal identity and visual details. It introduces MyTalk, a two-stage framework that first uses a speech-driven diffusion model to generate facial landmarks conditioned on speech and identity, then synthesizes appearance via a motion-conditioned generator that separately encodes lip and non-lip appearance along with motion, fused by a FusionNet. Key contributions include a landmark-based identity loss for motion generation, a multi-encoder appearance model with a learned fusion mechanism, and training on a large, diverse dataset enabling generalization to unseen identities and controllable editing of appearance and emotion. The approach achieves superior lip-sync accuracy and visual fidelity compared with one-stage baselines, offering practical benefits for high-fidelity, identity-preserving video editing in AI-generated content.

Abstract

We aim to edit the lip movements in talking video according to the given speech while preserving the personal identity and visual details. The task can be decomposed into two sub-problems: (1) speech-driven lip motion generation and (2) visual appearance synthesis. Current solutions handle the two sub-problems within a single generative model, resulting in a challenging trade-off between lip-sync quality and visual details preservation. Instead, we propose to disentangle the motion and appearance, and then generate them one by one with a speech-to-motion diffusion model and a motion-conditioned appearance generation model. However, there still remain challenges in each stage, such as motion-aware identity preservation in (1) and visual details preservation in (2). Therefore, to preserve personal identity, we adopt landmarks to represent the motion, and further employ a landmark-based identity loss. To capture motion-agnostic visual details, we use separate encoders to encode the lip, non-lip appearance and motion, and then integrate them with a learned fusion module. We train MyTalk on a large-scale and diverse dataset. Experiments show that our method generalizes well to the unknown, even out-of-domain person, in terms of both lip sync and visual detail preservation. We encourage the readers to watch the videos on our project page (https://Ingrid789.github.io/MyTalk/).
Paper Structure (41 sections, 9 equations, 9 figures, 4 tables)

This paper contains 41 sections, 9 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Samples of the original talking videos and the ones generated by our method according to the given speech input, showcasing that our method performs well in both lip sync and visual detail preservation. It also generalizes well to the unknown and out-of-domain characters (e.g., the bottom case), enabling seamless lip sync for AI-generated videos (e.g., the middle and bottom cases).
  • Figure 2: A brief illustration of previous methods and ours. We compare the edited talking videos generated by different models to evaluate their visual details preservation and lip-sync quality. (a) gives a straightforward framework comparison. (b) shows the evaluation of different methods of visual detail preservation and lip-sync quality. Previous one-stage methods all struggle to simultaneously preserve visual details and ensure lip-sync quality, while our proposed motion-appearance disentangled two-stage method achieves excellent results in both aspects.
  • Figure 3: Our proposed MyTalk adopts a motion-appearance disentangled two-stage framework to realize talking video lip sync. (a) In the first stage, we adopt a speech-driven motion generation model to generate motion (i.e., landmark) sequences from the input speech with the diffusion model. (b) To better preserve the motion identity, we design an identity extractor and the corresponding identity loss in the motion generation model. (c) In the second stage, we use separate encoders to encode the motion-agnostic lip, non-lip appearance, and the generated motion. The encoded representations are fused with a FusionNet and decoded to the output video.
  • Figure 4: Examples of controllable generation. Benefiting from our disentanglement and identity preservation designs, MyTalk shows novel properties in controlling appearance and emotion while editing the talking video.
  • Figure 5: Qualitative comparison on paired video-speech data.
  • ...and 4 more figures