Table of Contents
Fetching ...

MegActor: Harness the Power of Raw Video for Vivid Portrait Animation

Shurong Yang, Huadong Li, Juhao Wu, Minhao Jing, Linze Li, Renhe Ji, Jiajun Liang, Haoqiang Fan

TL;DR

MegActor addresses the challenge of animating portraits from raw driving video by mitigating identity leakage through synthetic data and by stabilizing backgrounds via CLIP-based background encoding. It introduces a diffusion-based architecture with ReferenceNet, DrivenEncoder, a Temporal Layer, and ImageEncoder, integrating foreground/background cues and temporal coherence. The two-stage training on public datasets—without reliance on control nets—achieves results comparable to commercial models and demonstrates strong cross-identity generalization. This work advances open-source portrait animation by enabling high-fidelity, drive-from-video animation with robust background handling and identity protection.

Abstract

Despite raw driving videos contain richer information on facial expressions than intermediate representations such as landmarks in the field of portrait animation, they are seldom the subject of research. This is due to two challenges inherent in portrait animation driven with raw videos: 1) significant identity leakage; 2) Irrelevant background and facial details such as wrinkles degrade performance. To harnesses the power of the raw videos for vivid portrait animation, we proposed a pioneering conditional diffusion model named as MegActor. First, we introduced a synthetic data generation framework for creating videos with consistent motion and expressions but inconsistent IDs to mitigate the issue of ID leakage. Second, we segmented the foreground and background of the reference image and employed CLIP to encode the background details. This encoded information is then integrated into the network via a text embedding module, thereby ensuring the stability of the background. Finally, we further style transfer the appearance of the reference image to the driving video to eliminate the influence of facial details in the driving videos. Our final model was trained solely on public datasets, achieving results comparable to commercial models. We hope this will help the open-source community.The code is available at https://github.com/megvii-research/MegFaceAnimate.

MegActor: Harness the Power of Raw Video for Vivid Portrait Animation

TL;DR

MegActor addresses the challenge of animating portraits from raw driving video by mitigating identity leakage through synthetic data and by stabilizing backgrounds via CLIP-based background encoding. It introduces a diffusion-based architecture with ReferenceNet, DrivenEncoder, a Temporal Layer, and ImageEncoder, integrating foreground/background cues and temporal coherence. The two-stage training on public datasets—without reliance on control nets—achieves results comparable to commercial models and demonstrates strong cross-identity generalization. This work advances open-source portrait animation by enabling high-fidelity, drive-from-video animation with robust background handling and identity protection.

Abstract

Despite raw driving videos contain richer information on facial expressions than intermediate representations such as landmarks in the field of portrait animation, they are seldom the subject of research. This is due to two challenges inherent in portrait animation driven with raw videos: 1) significant identity leakage; 2) Irrelevant background and facial details such as wrinkles degrade performance. To harnesses the power of the raw videos for vivid portrait animation, we proposed a pioneering conditional diffusion model named as MegActor. First, we introduced a synthetic data generation framework for creating videos with consistent motion and expressions but inconsistent IDs to mitigate the issue of ID leakage. Second, we segmented the foreground and background of the reference image and employed CLIP to encode the background details. This encoded information is then integrated into the network via a text embedding module, thereby ensuring the stability of the background. Finally, we further style transfer the appearance of the reference image to the driving video to eliminate the influence of facial details in the driving videos. Our final model was trained solely on public datasets, achieving results comparable to commercial models. We hope this will help the open-source community.The code is available at https://github.com/megvii-research/MegFaceAnimate.
Paper Structure (19 sections, 4 figures, 1 table)

This paper contains 19 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Given a raw driving video (left column), MegActor can synthesize captivating and expressive animations (right column), encompassing the head pose variations and detailed facial expressions present in the input driving video with multiple reference portraits.
  • Figure 2: Overview of the proposed method. The raw video frames are processed by AI face-swapping and stylization to modify the character ID, then data augmentation methods such as scaling and aspect ratio adjustment are applied, and finally, all parts except the face are masked out to obtain the driving video, which is then fed into the DrivenEncoder. The encoding results of the DrivenEncoder are concatenated along the channel dimension with latent noise, the latent code of the reference image, and the foreground and background masks, and then fed into the Denoising UNet. MegActor's ReferenceNet extracts identity and background information of the character and injects this information into the Denoising UNet through cross-attention. CLIP encodes the reference image background and replaces the text embedding to be injected into the ReferenceNet and Denoising UNet.
  • Figure 3: Visualization results. To further demonstrate the generalizability of our approach, we used the official generated results from VASA xu2024vasa as the driving video (left column) to animate multiple reference images (right column) from VASA xu2024vasa. The enriched and realistic generated videos, encompassing consistent expressions and head movements, showcase the robustness of our approach.
  • Figure 4: Compared with SOTA portrait animation method EMO tian2024emo. Since EMO has not released its inference code, we selected cases from EMO's official demonstration for comparison. The visualization results show that our method achieves comparable effects to EMO.