Table of Contents
Fetching ...

Identity-Preserving Pose-Guided Character Animation via Facial Landmarks Transformation

Lianrui Mu, Xingze Zhou, Wenjie Zheng, Jiangnan Ye, Haoji Hu

TL;DR

The paper tackles identity preservation in pose-guided character animation when driving landmarks misalign with reference facial geometry. It introduces Facial Landmarks Transformation (FLT), a training-free, plug-and-play pipeline based on a 3D Morphable Model that converts 2D landmarks to a 3D face, enforces the reference identity by combining reference shape with driving expressions, and re-renders to produce transformed landmarks for generation. Key contributions include the FLT framework, its applicability as a drop-in tool for existing generation models, and open-source release, validated on two models (AnimateAnyone and ControlNeXt) and two datasets (TikTok and UBC Fashion) showing improved identity preservation and temporal coherence. The approach enables more faithful and consistent pose-guided animations in challenging scenarios, with potential impact on virtual character production and personalized video synthesis, while noting limitations in landmark detection under rapid motion and occlusion and prospects for end-to-end and full-body extensions.

Abstract

Creating realistic pose-guided image-to-video character animations while preserving facial identity remains challenging, especially in complex and dynamic scenarios such as dancing, where precise identity consistency is crucial. Existing methods frequently encounter difficulties maintaining facial coherence due to misalignments between facial landmarks extracted from driving videos that provide head pose and expression cues and the facial geometry of the reference images. To address this limitation, we introduce the Facial Landmarks Transformation (FLT) method, which leverages a 3D Morphable Model to address this limitation. FLT converts 2D landmarks into a 3D face model, adjusts the 3D face model to align with the reference identity, and then transforms them back into 2D landmarks to guide the image-to-video generation process. This approach ensures accurate alignment with the reference facial geometry, enhancing the consistency between generated videos and reference images. Experimental results demonstrate that FLT effectively preserves facial identity, significantly improving pose-guided character animation models.

Identity-Preserving Pose-Guided Character Animation via Facial Landmarks Transformation

TL;DR

The paper tackles identity preservation in pose-guided character animation when driving landmarks misalign with reference facial geometry. It introduces Facial Landmarks Transformation (FLT), a training-free, plug-and-play pipeline based on a 3D Morphable Model that converts 2D landmarks to a 3D face, enforces the reference identity by combining reference shape with driving expressions, and re-renders to produce transformed landmarks for generation. Key contributions include the FLT framework, its applicability as a drop-in tool for existing generation models, and open-source release, validated on two models (AnimateAnyone and ControlNeXt) and two datasets (TikTok and UBC Fashion) showing improved identity preservation and temporal coherence. The approach enables more faithful and consistent pose-guided animations in challenging scenarios, with potential impact on virtual character production and personalized video synthesis, while noting limitations in landmark detection under rapid motion and occlusion and prospects for end-to-end and full-body extensions.

Abstract

Creating realistic pose-guided image-to-video character animations while preserving facial identity remains challenging, especially in complex and dynamic scenarios such as dancing, where precise identity consistency is crucial. Existing methods frequently encounter difficulties maintaining facial coherence due to misalignments between facial landmarks extracted from driving videos that provide head pose and expression cues and the facial geometry of the reference images. To address this limitation, we introduce the Facial Landmarks Transformation (FLT) method, which leverages a 3D Morphable Model to address this limitation. FLT converts 2D landmarks into a 3D face model, adjusts the 3D face model to align with the reference identity, and then transforms them back into 2D landmarks to guide the image-to-video generation process. This approach ensures accurate alignment with the reference facial geometry, enhancing the consistency between generated videos and reference images. Experimental results demonstrate that FLT effectively preserves facial identity, significantly improving pose-guided character animation models.

Paper Structure

This paper contains 14 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: We propose a facial landmark transformation approach using 3D face reconstruction. Our method aligns the driving image's landmarks with a reference face, significantly improving identity consistency in pose-guided generation, even under large facial geometry differences.
  • Figure 2: Overview of the proposed framework. This pipeline aims to preserve facial identity and consistency in pose-guided video generation. Given a reference image and a driving image, facial landmarks $\mathbf{L}_{\text{ref}}$ and $\mathbf{L}_{\text{drive}}$ are first extracted from both sources. These landmarks are then fitted into a 3D Morphable Model (3DMM) to reconstruct 3D face shapes and capture pose information. Then we use the shape PCA coefficients of the reference image $\mathbf{S}_{\text{ref}}$ and the expression blend shape coefficients of the driving image $\mathbf{E}_{\text{drive}}$ to generate a transformed 3D face mesh $\text{Mesh}_{\text{trans}}$, which is subsequently re-rendered into 2D, ensuring that the reference identity is preserved while adopting the pose and expression dynamics of the driving image. A landmarks detector is applied to extract facial landmarks from the re-rendered face, which are then used as input to guide the video generation model. This approach ensures that the generated video maintains facial consistency and identity throughout dynamic poses and complex motions.
  • Figure 3: We compared the generated images with and without our FLT method when applied to AnimateAnyone hu2024animate and ControlNeXt peng2024ControlNeXt. The results show that our method effectively preserves the reference image's facial features, even with significant differences in facial contours, highlighting FLT's superior identity-preserving performance.
  • Figure 4: We employ a dataset shuffle procedure when testing FLT performance to evaluate the identity-preserving ability of our method.