Table of Contents
Fetching ...

FG-Portrait: 3D Flow Guided Editable Portrait Animation

Yating Xu, Yunqi Miao, Evangelos Ververas, Jiankang Deng, Jifei Song

Abstract

Motion transfer from the driving to the source portrait remains a key challenge in the portrait animation. Current diffusion-based approaches condition only on the driving motion, which fails to capture source-to-driving correspondences and consequently yields suboptimal motion transfer. Although flow estimation provides an alternative, predicting dense correspondences from 2D input is ill-posed and often yields inaccurate animation. We address this problem by introducing 3D flows, a learning-free and geometry-driven motion correspondence directly computed from parametric 3D head models. To integrate this 3D prior into diffusion model, we introduce 3D flow encoding to query potential 3D flows for each target pixel to indicate its displacement back to the source location. To obtain 3D flows aligned with 2D motion changes, we further propose depth-guided sampling to accurately locate the corresponding 3D points for each pixel. Beyond high-fidelity portrait animation, our model further supports user-specified editing of facial expression and head pose. Extensive experiments demonstrate the superiority of our method on consistent driving motion transfer as well as faithful source identity preservation.

FG-Portrait: 3D Flow Guided Editable Portrait Animation

Abstract

Motion transfer from the driving to the source portrait remains a key challenge in the portrait animation. Current diffusion-based approaches condition only on the driving motion, which fails to capture source-to-driving correspondences and consequently yields suboptimal motion transfer. Although flow estimation provides an alternative, predicting dense correspondences from 2D input is ill-posed and often yields inaccurate animation. We address this problem by introducing 3D flows, a learning-free and geometry-driven motion correspondence directly computed from parametric 3D head models. To integrate this 3D prior into diffusion model, we introduce 3D flow encoding to query potential 3D flows for each target pixel to indicate its displacement back to the source location. To obtain 3D flows aligned with 2D motion changes, we further propose depth-guided sampling to accurately locate the corresponding 3D points for each pixel. Beyond high-fidelity portrait animation, our model further supports user-specified editing of facial expression and head pose. Extensive experiments demonstrate the superiority of our method on consistent driving motion transfer as well as faithful source identity preservation.
Paper Structure (20 sections, 8 equations, 11 figures, 9 tables)

This paper contains 20 sections, 8 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Given a source and driving image, the goal of portrait animation is to synthesize the target image with source identity and driving head pose and expression. APD($\downarrow$) and AED($\downarrow$) denote the average pose and expression error compared to the driving pose and expression, respectively. Compared to other diffusion-based models, e.g. X-Portrait xie2024x and Face-Adapter han2024face, we can more accurately mimic the driving motion. Furthermore, we support feed-forward head pose and expression editing in the inference stage.
  • Figure 2: Visualization of 3D flows. Target portrait is the animated image with the source identity and driving motion. Target 3D head is the head model of target portrait, which is pre-computed by assembling source shape and driving motion parameters. We select some corresponding points both in the 2D and 3D, with red denoting the source and green denoting the target position. The 3D flows (black lines) correctly reflect the displacement from the target to the source position for each point. The yellow circles mark one example of a pair of points and the corresponding 3D flow.
  • Figure 3: Our Framework. We propose the 3D flow encoding $F_{src \leftarrow tgt}$ as the new motion condition to the ControlNet $G$. Specifically, we first estimate the FLAME model $M_{tgt}$ and $M_{src}$ from $I_{dri}$ and $I_{src}$. We perform depth-guided sampling to query the 3D flows in the target space for each pixel to indicate its displacement back to the source location, which are then stacked as the 3D flow encoding. During inference, the driving motion $(\theta_{dri}, \psi_{dri})$ can come from the driving image or user-defined editing. During training, $I_{dri}$ is sampled from the video of the same person as $I_{src}$ and the training objective is to reconstruct $I_{tgt}$ as $I_{dri}$.
  • Figure 4: Illustration of depth-guided sampling. Arrows show the 3D flows for the selected pixel marked with yellow circle. The sampled points in (b) are closer to the actual 3D location of the pixel than (a), resulting in 3D flows that are more faithfully aligned with the 2D motion.
  • Figure 5: Qualitative results on self-reenactment. "Source" and "Driving" denote the source and driving image, respectively.
  • ...and 6 more figures