Table of Contents
Fetching ...

Towards Native Generative Model for 3D Head Avatar

Yiyu Zhuang, Yuxiao He, Jiawei Zhang, Yanwen Wang, Jiahe Zhu, Yao Yao, Siyu Zhu, Xun Cao, Hao Zhu

TL;DR

This work tackles the challenge of producing native, 360$^\circ$ renderable 3D head avatars from limited, high-quality 3D data by exploring three complementary representations (volume-based NeRF, hex-plane hybrid, and point-based Gaussian splats) and by disentangling appearance, shape, and expression in a semantically constrained parametric space $(\alpha,\beta,\varepsilon)$. A new SynHead100 dataset and tailored single-image fitting, animation, and text-based editing pipelines enable random 3D head generation, full-view rendering, and editable motion while preserving identity. Across extensive experiments, the authors demonstrate state-of-the-art 3D geometry accuracy and rendering quality, with the point-based approach offering the best fidelity and efficiency and the hybrid approach providing robust animation control. The work advances practical 360$^\circ$ head synthesis from limited data, with potential impact on metaverse, film, and digital avatar applications, while acknowledging data costs and the need for lighting/material disentangling for further realism.

Abstract

Creating 3D head avatars is a significant yet challenging task for many applicated scenarios. Previous studies have set out to learn 3D human head generative models using massive 2D image data. Although these models are highly generalizable for human appearance, their result models are not 360$^\circ$-renderable, and the predicted 3D geometry is unreliable. Therefore, such results cannot be used in VR, game modeling, and other scenarios that require 360$^\circ$-renderable 3D head models. An intuitive idea is that 3D head models with limited amount but high 3D accuracy are more reliable training data for a high-quality 3D generative model. In this vein, we delve into how to learn a native generative model for 360$^\circ$ full head from a limited 3D head dataset. Specifically, three major problems are studied: 1) how to effectively utilize various representations for generating the 360$^\circ$-renderable human head; 2) how to disentangle the appearance, shape, and motion of human faces to generate a 3D head model that can be edited by appearance and driven by motion; 3) and how to extend the generalization capability of the generative model to support downstream tasks. Comprehensive experiments are conducted to verify the effectiveness of the proposed model. We hope the proposed models and artist-designed dataset can inspire future research on learning native generative 3D head models from limited 3D datasets.

Towards Native Generative Model for 3D Head Avatar

TL;DR

This work tackles the challenge of producing native, 360 renderable 3D head avatars from limited, high-quality 3D data by exploring three complementary representations (volume-based NeRF, hex-plane hybrid, and point-based Gaussian splats) and by disentangling appearance, shape, and expression in a semantically constrained parametric space . A new SynHead100 dataset and tailored single-image fitting, animation, and text-based editing pipelines enable random 3D head generation, full-view rendering, and editable motion while preserving identity. Across extensive experiments, the authors demonstrate state-of-the-art 3D geometry accuracy and rendering quality, with the point-based approach offering the best fidelity and efficiency and the hybrid approach providing robust animation control. The work advances practical 360 head synthesis from limited data, with potential impact on metaverse, film, and digital avatar applications, while acknowledging data costs and the need for lighting/material disentangling for further realism.

Abstract

Creating 3D head avatars is a significant yet challenging task for many applicated scenarios. Previous studies have set out to learn 3D human head generative models using massive 2D image data. Although these models are highly generalizable for human appearance, their result models are not 360-renderable, and the predicted 3D geometry is unreliable. Therefore, such results cannot be used in VR, game modeling, and other scenarios that require 360-renderable 3D head models. An intuitive idea is that 3D head models with limited amount but high 3D accuracy are more reliable training data for a high-quality 3D generative model. In this vein, we delve into how to learn a native generative model for 360 full head from a limited 3D head dataset. Specifically, three major problems are studied: 1) how to effectively utilize various representations for generating the 360-renderable human head; 2) how to disentangle the appearance, shape, and motion of human faces to generate a 3D head model that can be edited by appearance and driven by motion; 3) and how to extend the generalization capability of the generative model to support downstream tasks. Comprehensive experiments are conducted to verify the effectiveness of the proposed model. We hope the proposed models and artist-designed dataset can inspire future research on learning native generative 3D head models from limited 3D datasets.
Paper Structure (39 sections, 18 equations, 18 figures, 5 tables)

This paper contains 39 sections, 18 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Visualization of 3D head datasets.
  • Figure 2: Illustration of volume-based, hybrid-based, and point-based 3D head representations. (a) Volume-based representation employs MLP for representing the radiance field, resulting in a slower rendering process when generating images through volume rendering. (b) Hybrid-based representation integrates explicit parametric mesh and hex-plane structures with compact MLP, facilitating enhanced control over expressions and increased detail. (c) Point-based representation retains the explicit parametric mesh while incorporating more efficient UV-mapped Gaussian attributes for avatar representation, improving quality and efficiency.
  • Figure 3: The pipelines of volume-based, hybrid-based, and point-based 3D head representations.(a) The volume-based model is built by taking the parameter $(\alpha, \beta, \epsilon)$ as conditions of the NeRF function. Then, the volume rendering is leveraged to transform the radiance field into into images. (b) The hybrid-based model introduces neural texture to combine explicit parametric 3D models with implicit neural radiance field. Furthermore, hex-planes are used to model the head and hair separately. This way, the parameters $(\alpha, \beta, \epsilon)$ are disentangled at the model level. (c) The point-based model inherits the idea of the disentanglement of parameters $(\alpha, \beta, \epsilon)$ at the model level. The key difference is to leverage attribute maps instead of neural texture and a point-based rendering process.
  • Figure 4: Visualization of our fitting pipeline. (a) provides an overview of our fitting pipeline, which iteratively optimizes values in the parameters space to fit an in-the-wild image. After optimizing, we create a 3D head model in our model space, enabling rigging and further editing applications. (b) illustrates the pre-processing step to normalize the input image, bringing it closer to the synthetic dataset distribution. (c) presents our approach to fitting the full head, separately considering the hair and face regions. We use GPT-4o to select the closest hairstyles in our SynHead100 and combine it with a random initialized appearance code, as well as initialize the face shape with Poisson blending. This creates a new reference image for full-head optimization.
  • Figure 5: Comparison of fitting results on in-the-wild images. We compare our method with previous parametric or generative 3D head models in single-image fitting. For a comprehensive comparison, both original models and re-trained models are compared.
  • ...and 13 more figures