Table of Contents
Fetching ...

AvatarArtist: Open-Domain 4D Avatarization

Hongyu Liu, Xuan Wang, Ziyu Wan, Yue Ma, Jingye Chen, Yanbo Fan, Yujun Shen, Yibing Song, Qifeng Chen

TL;DR

AvatarArtist addresses open-domain 4D avatarization from a single portrait by uniting a diffusion-driven multi-domain data pipeline with domain-specific 4DGANs based on parametric triplanes. A latent diffusion transformer (DiT) models the 4D distribution conditioned on the input portrait, while a motion-aware cross-domain renderer preserves identity and accurately transfers motion across viewpoints. The approach enables scalable generation of image-4D pairs across 28 domains and achieves robust cross-domain reenactment on VFHQ, outperforming or matching state-of-the-art baselines in key metrics. This work provides a practical framework for open-domain, stylized, animatable 4D avatars suitable for AR/VR, games, and social-media applications, with publicly available code, data, and models.

Abstract

This work focuses on open-domain 4D avatarization, with the purpose of creating a 4D avatar from a portrait image in an arbitrary style. We select parametric triplanes as the intermediate 4D representation and propose a practical training paradigm that takes advantage of both generative adversarial networks (GANs) and diffusion models. Our design stems from the observation that 4D GANs excel at bridging images and triplanes without supervision yet usually face challenges in handling diverse data distributions. A robust 2D diffusion prior emerges as the solution, assisting the GAN in transferring its expertise across various domains. The synergy between these experts permits the construction of a multi-domain image-triplane dataset, which drives the development of a general 4D avatar creator. Extensive experiments suggest that our model, AvatarArtist, is capable of producing high-quality 4D avatars with strong robustness to various source image domains. The code, the data, and the models will be made publicly available to facilitate future studies.

AvatarArtist: Open-Domain 4D Avatarization

TL;DR

AvatarArtist addresses open-domain 4D avatarization from a single portrait by uniting a diffusion-driven multi-domain data pipeline with domain-specific 4DGANs based on parametric triplanes. A latent diffusion transformer (DiT) models the 4D distribution conditioned on the input portrait, while a motion-aware cross-domain renderer preserves identity and accurately transfers motion across viewpoints. The approach enables scalable generation of image-4D pairs across 28 domains and achieves robust cross-domain reenactment on VFHQ, outperforming or matching state-of-the-art baselines in key metrics. This work provides a practical framework for open-domain, stylized, animatable 4D avatars suitable for AR/VR, games, and social-media applications, with publicly available code, data, and models.

Abstract

This work focuses on open-domain 4D avatarization, with the purpose of creating a 4D avatar from a portrait image in an arbitrary style. We select parametric triplanes as the intermediate 4D representation and propose a practical training paradigm that takes advantage of both generative adversarial networks (GANs) and diffusion models. Our design stems from the observation that 4D GANs excel at bridging images and triplanes without supervision yet usually face challenges in handling diverse data distributions. A robust 2D diffusion prior emerges as the solution, assisting the GAN in transferring its expertise across various domains. The synergy between these experts permits the construction of a multi-domain image-triplane dataset, which drives the development of a general 4D avatar creator. Extensive experiments suggest that our model, AvatarArtist, is capable of producing high-quality 4D avatars with strong robustness to various source image domains. The code, the data, and the models will be made publicly available to facilitate future studies.

Paper Structure

This paper contains 28 sections, 1 equation, 13 figures, 5 tables.

Figures (13)

  • Figure 1: The overall training pipeline of our method. We first generate 2D images from different domains using a 2D diffusion model. These images are then used to train 4D GANs for each domain. Subsequently, the trained 4D GANs generate image-4D representation pairs across domains, which are used to train DIT and the rendering model.
  • Figure 2: The pipeline of dataset generation.We use text prompts to transform images from the realistic domain to the target domain while ensuring pose and expression consistency with SDEdit meng2021sdedit and landmark-guided ControlNet zhang2023adding. This enables direct reuse of the original mesh, avoiding errors in non-realistic domain extraction. After domain transfer, we train 4D GANs to generate image-parametric triplane pairs, which serve as data for the next stage. The parametric triplane comprises dynamic and static components, with the dynamic region aligned to the mesh.
  • Figure 3: The pipeline of DiT. We first train a VAE to compress the parametric triplane into a latent space, and then train a DiT to denoise the noisy latent. We incorporate features from DINO caron2021emerging and CLIP radford2021learning into the DiT to guide the generation process.
  • Figure 4: The pipeline of motion-aware cross-domain renderer. We use an encoder to extract the feature from the source image. This feature is sent to a ViT to predict results under the guidance of generated parametric triplane and motion embedding. Finally, a decoder processes the output of the ViT and fuses it with the results of rasterization to produce the final output.
  • Figure 5: Qualitative comparison with SOTA methods. The leftmost column in the figure shows the input images, with the bottom-right corner representing the target image. The first row displays the results of self-reenactment, while the following three rows show the results of cross-reenactment. It can be observed that our method achieves superior performance in terms of expression and pose consistency, as well as identity preservation.
  • ...and 8 more figures