Table of Contents
Fetching ...

HeadArtist: Text-conditioned 3D Head Generation with Self Score Distillation

Hongyu Liu, Xuan Wang, Ziyu Wan, Yujun Shen, Yibing Song, Jing Liao, Qifeng Chen

TL;DR

HeadArtist introduces Self Score Distillation (SSD), a landmark-guided ControlNet–based framework for text-conditioned 3D head generation that jointly optimizes geometry and texture on a DMTet+FLAME representation. By sampling two aligned score predictions from the same ControlNet conditioned on text and facial landmarks, SSD sidesteps common SDS/VSD issues such as over-saturation and Janus artifacts while embedding strong facial priors. The geometry and texture are generated sequentially and can be edited via natural language prompts, with negative prompts further improving texture realism. Across qualitative and quantitative evaluations, HeadArtist achieves superior fidelity and demonstrates robust editing capabilities, marking a step forward in high-fidelity, editable 3D head synthesis from language.

Abstract

This work presents HeadArtist for 3D head generation from text descriptions. With a landmark-guided ControlNet serving as the generative prior, we come up with an efficient pipeline that optimizes a parameterized 3D head model under the supervision of the prior distillation itself. We call such a process self score distillation (SSD). In detail, given a sampled camera pose, we first render an image and its corresponding landmarks from the head model, and add some particular level of noise onto the image. The noisy image, landmarks, and text condition are then fed into the frozen ControlNet twice for noise prediction. Two different classifier-free guidance (CFG) weights are applied during these two predictions, and the prediction difference offers a direction on how the rendered image can better match the text of interest. Experimental results suggest that our approach delivers high-quality 3D head sculptures with adequate geometry and photorealistic appearance, significantly outperforming state-ofthe-art methods. We also show that the same pipeline well supports editing the generated heads, including both geometry deformation and appearance change.

HeadArtist: Text-conditioned 3D Head Generation with Self Score Distillation

TL;DR

HeadArtist introduces Self Score Distillation (SSD), a landmark-guided ControlNet–based framework for text-conditioned 3D head generation that jointly optimizes geometry and texture on a DMTet+FLAME representation. By sampling two aligned score predictions from the same ControlNet conditioned on text and facial landmarks, SSD sidesteps common SDS/VSD issues such as over-saturation and Janus artifacts while embedding strong facial priors. The geometry and texture are generated sequentially and can be edited via natural language prompts, with negative prompts further improving texture realism. Across qualitative and quantitative evaluations, HeadArtist achieves superior fidelity and demonstrates robust editing capabilities, marking a step forward in high-fidelity, editable 3D head synthesis from language.

Abstract

This work presents HeadArtist for 3D head generation from text descriptions. With a landmark-guided ControlNet serving as the generative prior, we come up with an efficient pipeline that optimizes a parameterized 3D head model under the supervision of the prior distillation itself. We call such a process self score distillation (SSD). In detail, given a sampled camera pose, we first render an image and its corresponding landmarks from the head model, and add some particular level of noise onto the image. The noisy image, landmarks, and text condition are then fed into the frozen ControlNet twice for noise prediction. Two different classifier-free guidance (CFG) weights are applied during these two predictions, and the prediction difference offers a direction on how the rendered image can better match the text of interest. Experimental results suggest that our approach delivers high-quality 3D head sculptures with adequate geometry and photorealistic appearance, significantly outperforming state-ofthe-art methods. We also show that the same pipeline well supports editing the generated heads, including both geometry deformation and appearance change.
Paper Structure (29 sections, 6 equations, 13 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 6 equations, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: We conduct two steps to generate geometry and textures, respectively. For the geometry step, we employ DMTet mesh shen2021deep initialized by Flame FLAME:SiggraphAsia2017 as a representation of 3D human head's geometry. For the texture step, we fix the generated geometry and construct a texture space based on DMTet. The texture space construction is similar to that in Magic3D lin2023magic3d. In these two steps, we project mesh keypoints to get landmarks guided by a camera. Afterward, we render a normal map and a texture in these two steps for model training via our self-score distillation, the $x_t$ in these two steps are the normal map and texture with noise respectively.
  • Figure 2: Qualitative comparison with SOTA methods. Text Prompts: (A) a DSLR portrait of Lionel Messi. (B) a DSLR portrait of a female soldier, wearing a helmet. (C) a DSLR portrait of a boy with facial painting. Overall, our method demonstrates superior fidelity in terms of both geometry and textures.
  • Figure 3: The editing results of our HeadArtist. Our approach enables the manipulation of geometry and textures, thereby aligning the resulting content with the corresponding human-language descriptions.
  • Figure 4: Ablation study. Text Prompts: 1) a DSLR portrait of a young man with dreadlocks; 2) a DSLR portrait of Obama with a baseball cap. Compared with (a), (b), and (c), our full configuration in (e) avoids the multi-face Janus problems (e.g., multiple brims of Obama). Meanwhile, (e) can generate complex geometries which are better match the text(i.e., (a)$\sim$(d) can not generate the dreadlocks). Besides, (e) predicts more natural and fidelity colors and helps the texture fit the geometry better than (d). (f) shows our results without negative prompts.
  • Figure 5: The failure results of our method. We find our method can not handle such complex Japanese animation characters well.
  • ...and 8 more figures