Table of Contents
Fetching ...

FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image

Fei Yin, Mallikarjun B R, Chun-Han Yao, Rafał Mantiuk, Varun Jampani

TL;DR

FaceCraft4D tackles the challenge of animatable 4D avatar generation from a single image by integrating three priors—shape from 3D-GAN inversion, image priors with depth-guided cross-view warping and diffusion-based texture refinement, and a video prior for synchronized multi-view expressions. It introduces COIN training to robustly learn a consistent base representation (GaussianAvatar) while capturing view-specific details through a lightweight MLP that handles inconsistencies. The two-stage pipeline first synthesizes personalized multiview data and then optimizes a 4D representation that can be animated via FLAME parameters, achieving superior shape fidelity, texture quality, and cross-view identity preservation. The method enables high-quality, 360-degree avatar rendering from a single image with practical training and real-time rendering capabilities, broadening accessibility for applications in gaming, education, and film.

Abstract

We present a novel framework for generating high-quality, animatable 4D avatar from a single image. While recent advances have shown promising results in 4D avatar creation, existing methods either require extensive multiview data or struggle with shape accuracy and identity consistency. To address these limitations, we propose a comprehensive system that leverages shape, image, and video priors to create full-view, animatable avatars. Our approach first obtains initial coarse shape through 3D-GAN inversion. Then, it enhances multiview textures using depth-guided warping signals for cross-view consistency with the help of the image diffusion model. To handle expression animation, we incorporate a video prior with synchronized driving signals across viewpoints. We further introduce a Consistent-Inconsistent training to effectively handle data inconsistencies during 4D reconstruction. Experimental results demonstrate that our method achieves superior quality compared to the prior art, while maintaining consistency across different viewpoints and expressions.

FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image

TL;DR

FaceCraft4D tackles the challenge of animatable 4D avatar generation from a single image by integrating three priors—shape from 3D-GAN inversion, image priors with depth-guided cross-view warping and diffusion-based texture refinement, and a video prior for synchronized multi-view expressions. It introduces COIN training to robustly learn a consistent base representation (GaussianAvatar) while capturing view-specific details through a lightweight MLP that handles inconsistencies. The two-stage pipeline first synthesizes personalized multiview data and then optimizes a 4D representation that can be animated via FLAME parameters, achieving superior shape fidelity, texture quality, and cross-view identity preservation. The method enables high-quality, 360-degree avatar rendering from a single image with practical training and real-time rendering capabilities, broadening accessibility for applications in gaming, education, and film.

Abstract

We present a novel framework for generating high-quality, animatable 4D avatar from a single image. While recent advances have shown promising results in 4D avatar creation, existing methods either require extensive multiview data or struggle with shape accuracy and identity consistency. To address these limitations, we propose a comprehensive system that leverages shape, image, and video priors to create full-view, animatable avatars. Our approach first obtains initial coarse shape through 3D-GAN inversion. Then, it enhances multiview textures using depth-guided warping signals for cross-view consistency with the help of the image diffusion model. To handle expression animation, we incorporate a video prior with synchronized driving signals across viewpoints. We further introduce a Consistent-Inconsistent training to effectively handle data inconsistencies during 4D reconstruction. Experimental results demonstrate that our method achieves superior quality compared to the prior art, while maintaining consistency across different viewpoints and expressions.

Paper Structure

This paper contains 20 sections, 4 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Given a single image input, our method is capable of generating 4D avatars, producing photorealistic textures and consistent expressions across multiple views. Our approach also demonstrates robust performance on challenging inputs, including cartoon characters or old photographs.
  • Figure 2: Overview of FaceCraft4D: Our approach begins with estimating the shape of a single input image using a shape Prior. This shape guides the synthesis of personalized multiview images, providing $360^\circ$ views and varied expressions, with support from both 2D image and video priors. Since the synthesized data often exhibit inconsistency across views, we propose COIN optimization for robust 4D optimization. By fitting the model to the multiview data, we achieve our final 4D avatar. All faces in this manuscript come from the public FFHQ dataset karras2019style.
  • Figure 3: Image Prior: We introduce a cross-view mutual attention mechanism and epipolar constraints to enhance consistency in generated novel views. Our approach aligns reference and target images, maintaining visual coherence across viewpoints.
  • Figure 4: Triplane-based methods (e.g. PanoHead) are sensitive to focal length (scale) and exhibit significant degradation of quality as focal length decreases. In contrast, our Gaussian-based methods maintain robust performance across varying focal lengths.
  • Figure 5: COIN-Training: To address shape and color inconsistencies between the views, we train two 3D representations: a view-consistent base model (GaussianAvatar) and an MLP, encoding inconsistencies between the views. The MLP with inconsistencies lets us robustly reconstruct the high-quality 3D base model.
  • ...and 8 more figures