Table of Contents
Fetching ...

CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models

Felix Taubner, Ruihang Zhang, Mathieu Tuli, David B. Lindell

TL;DR

CAP4D introduces a morphable multi-view diffusion framework (MMDM) that can synthesize hundreds of novel portrait views from an arbitrary set of reference images (1–100) and reconstructs a photoreal 4D avatar via 3D Gaussian splatting. The method uses stochastic input–output conditioning to scale diffusion-based generation to large reference sets and employs a 3DMM-conditioned conditioning pipeline (pose, expression, view) to enforce consistent identities across views and expressions. A UV-aware deformation network predicts expression-driven corrections, enabling realistic wrinkles and hair details within a FLAME-based head representation, while the final 4D avatar supports real-time rendering and reenactment. Extensive self- and cross-reenactment experiments demonstrate state-of-the-art performance across 1, 10, and 100 reference images, with ablations confirming the contributions of MMDM, stochastic conditioning, and the 4D UV deformation design. CAP4D thus bridges single-image priors and multi-view fidelity to deliver scalable, animatable 4D portrait avatars relevant to advertising, VFX, and telepresence.

Abstract

Reconstructing photorealistic and dynamic portrait avatars from images is essential to many applications including advertising, visual effects, and virtual reality. Depending on the application, avatar reconstruction involves different capture setups and constraints $-$ for example, visual effects studios use camera arrays to capture hundreds of reference images, while content creators may seek to animate a single portrait image downloaded from the internet. As such, there is a large and heterogeneous ecosystem of methods for avatar reconstruction. Techniques based on multi-view stereo or neural rendering achieve the highest quality results, but require hundreds of reference images. Recent generative models produce convincing avatars from a single reference image, but visual fidelity yet lags behind multi-view techniques. Here, we present CAP4D: an approach that uses a morphable multi-view diffusion model to reconstruct photoreal 4D (dynamic 3D) portrait avatars from any number of reference images (i.e., one to 100) and animate and render them in real time. Our approach demonstrates state-of-the-art performance for single-, few-, and multi-image 4D portrait avatar reconstruction, and takes steps to bridge the gap in visual fidelity between single-image and multi-view reconstruction techniques.

CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models

TL;DR

CAP4D introduces a morphable multi-view diffusion framework (MMDM) that can synthesize hundreds of novel portrait views from an arbitrary set of reference images (1–100) and reconstructs a photoreal 4D avatar via 3D Gaussian splatting. The method uses stochastic input–output conditioning to scale diffusion-based generation to large reference sets and employs a 3DMM-conditioned conditioning pipeline (pose, expression, view) to enforce consistent identities across views and expressions. A UV-aware deformation network predicts expression-driven corrections, enabling realistic wrinkles and hair details within a FLAME-based head representation, while the final 4D avatar supports real-time rendering and reenactment. Extensive self- and cross-reenactment experiments demonstrate state-of-the-art performance across 1, 10, and 100 reference images, with ablations confirming the contributions of MMDM, stochastic conditioning, and the 4D UV deformation design. CAP4D thus bridges single-image priors and multi-view fidelity to deliver scalable, animatable 4D portrait avatars relevant to advertising, VFX, and telepresence.

Abstract

Reconstructing photorealistic and dynamic portrait avatars from images is essential to many applications including advertising, visual effects, and virtual reality. Depending on the application, avatar reconstruction involves different capture setups and constraints for example, visual effects studios use camera arrays to capture hundreds of reference images, while content creators may seek to animate a single portrait image downloaded from the internet. As such, there is a large and heterogeneous ecosystem of methods for avatar reconstruction. Techniques based on multi-view stereo or neural rendering achieve the highest quality results, but require hundreds of reference images. Recent generative models produce convincing avatars from a single reference image, but visual fidelity yet lags behind multi-view techniques. Here, we present CAP4D: an approach that uses a morphable multi-view diffusion model to reconstruct photoreal 4D (dynamic 3D) portrait avatars from any number of reference images (i.e., one to 100) and animate and render them in real time. Our approach demonstrates state-of-the-art performance for single-, few-, and multi-image 4D portrait avatar reconstruction, and takes steps to bridge the gap in visual fidelity between single-image and multi-view reconstruction techniques.

Paper Structure

This paper contains 51 sections, 7 equations, 15 figures, 6 tables, 1 algorithm.

Figures (15)

  • Figure 1: We present CAP4D: a method that generates 4D portrait avatars based on an arbitrary number of reference images (e.g., from one to one hundred) and animates them in real time. Our approach uses a morphable multi-view diffusion model to predict novel views with unseen expressions. For each subject, we generate hundreds of such views and train an animatable avatar using a representation based on 3D Gaussian splatting. Our method demonstrates state-of-the-art results for portrait view synthesis from a single image, monocular videos, or multi-view camera setups based on visual quality, identity consistency, 3D structure, and temporal consistency.
  • Figure 2: Overview of CAP4D. (a) The method takes as input an arbitrary number of reference images $\mathbf{I}_\text{ref}$ that are encoded into the latent space of a variational autoencoder rombach2022high. An off-the-shelf face tracker estimates a 3DMM, $\mathbf{M}_\text{ref}$, for each reference image, from which we derive conditioning signals that describe camera view direction, $\mathbf{V}_\text{ref}$, head pose $\mathbf{P}_\text{ref}$, and expression $\mathbf{E}_\text{ref}$. We associate additional conditioning signals with each input noisy latent image based on the desired generated viewpoints, poses, and expressions. The MMDM generates images through a stochastic input--output conditioning procedure that randomly samples reference images and generated images during each step of the iterative image generation process. (b) The generated and reference images are used with the tracked and sampled 3DMMs to reconstruct a 4D avatar based on a 3D Gaussian splatting representation kerbl20233dqian2024gaussianavatars.
  • Figure 3: Self-reenactment. Our approach is more realistic than baseline methods for self-reenactment from a single reference image (row 1), 10 reference images (row 2) and 100 reference images (row 3). The MMDM output (MMDM only) produces the most realistic output at the cost of temporal consistency compared to our reconstructed 4D Avatar (CAP4D).
  • Figure 4: Cross-reenactment. Avatars are reconstructed from a single reference image (col. 1), and their expressions are driven by frames of a driving video (col. 2). The camera moves according to the indicated horizontal (H) and vertical (V) view angle. CAP4D faithfully recovers the driving expression and maintains the likeness of the reference subject from challenging view directions. It generates plausible results in occluded regions based on the reference image (e.g., earrings, row 1) and recovers high-frequency details (freckles, row 1).
  • Figure 5: Extensions. We demonstrate 4D appearance editing and relighting by applying CAP4D to images edited using off-the-shelf models zhang2024stablemakeupponglertnapakorn2023difareli. We also animate CAP4D avatars with a method that predicts 3DMM expressions from speech xing2023codetalker (see supplement).
  • ...and 10 more figures