Table of Contents
Fetching ...

Rodin: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion

Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, Baining Guo

TL;DR

The paper tackles the challenge of generating high-detail 3D avatars with diffusion models by introducing Rodin, a roll-out diffusion network that uses tri-plane neural radiance fields rolled into a 2D feature plane. It combines 3D-aware convolution, latent conditioning via a shared encoder and CLIP, and hierarchical diffusion upsampling to achieve coherent, editable 3D avatars at high resolution. Through extensive experiments on a large synthetic dataset, Rodin demonstrates superior visual fidelity and editing capabilities, including portrait inversion and text-guided avatar manipulation, compared with prior 3D-aware methods. This approach offers a scalable pathway to high-quality 3D content, with potential extension to general 3D scenes and faster sampling.

Abstract

This paper presents a 3D generative model that uses diffusion models to automatically generate 3D digital avatars represented as neural radiance fields. A significant challenge in generating such avatars is that the memory and processing costs in 3D are prohibitive for producing the rich details required for high-quality avatars. To tackle this problem we propose the roll-out diffusion network (Rodin), which represents a neural radiance field as multiple 2D feature maps and rolls out these maps into a single 2D feature plane within which we perform 3D-aware diffusion. The Rodin model brings the much-needed computational efficiency while preserving the integrity of diffusion in 3D by using 3D-aware convolution that attends to projected features in the 2D feature plane according to their original relationship in 3D. We also use latent conditioning to orchestrate the feature generation for global coherence, leading to high-fidelity avatars and enabling their semantic editing based on text prompts. Finally, we use hierarchical synthesis to further enhance details. The 3D avatars generated by our model compare favorably with those produced by existing generative techniques. We can generate highly detailed avatars with realistic hairstyles and facial hair like beards. We also demonstrate 3D avatar generation from image or text as well as text-guided editability.

Rodin: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion

TL;DR

The paper tackles the challenge of generating high-detail 3D avatars with diffusion models by introducing Rodin, a roll-out diffusion network that uses tri-plane neural radiance fields rolled into a 2D feature plane. It combines 3D-aware convolution, latent conditioning via a shared encoder and CLIP, and hierarchical diffusion upsampling to achieve coherent, editable 3D avatars at high resolution. Through extensive experiments on a large synthetic dataset, Rodin demonstrates superior visual fidelity and editing capabilities, including portrait inversion and text-guided avatar manipulation, compared with prior 3D-aware methods. This approach offers a scalable pathway to high-quality 3D content, with potential extension to general 3D scenes and faster sampling.

Abstract

This paper presents a 3D generative model that uses diffusion models to automatically generate 3D digital avatars represented as neural radiance fields. A significant challenge in generating such avatars is that the memory and processing costs in 3D are prohibitive for producing the rich details required for high-quality avatars. To tackle this problem we propose the roll-out diffusion network (Rodin), which represents a neural radiance field as multiple 2D feature maps and rolls out these maps into a single 2D feature plane within which we perform 3D-aware diffusion. The Rodin model brings the much-needed computational efficiency while preserving the integrity of diffusion in 3D by using 3D-aware convolution that attends to projected features in the 2D feature plane according to their original relationship in 3D. We also use latent conditioning to orchestrate the feature generation for global coherence, leading to high-fidelity avatars and enabling their semantic editing based on text prompts. Finally, we use hierarchical synthesis to further enhance details. The 3D avatars generated by our model compare favorably with those produced by existing generative techniques. We can generate highly detailed avatars with realistic hairstyles and facial hair like beards. We also demonstrate 3D avatar generation from image or text as well as text-guided editability.
Paper Structure (28 sections, 7 equations, 21 figures, 4 tables)

This paper contains 28 sections, 7 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: An overview of our Rodin model. We derive the latent $\bm{z}$ via the mapping from image, text, or random noise, which is used to control the base diffusion model to generate $64\times 64$ tri-planes. We train another diffusion model to upsample this coarse result to $256\times 256$ tri-planes that are used to render final multi-view images with volumetric rendering and convolutional refinement. The operators used in diffusion models are designed to be 3D-aware.
  • Figure 2: While $256\times 256$ tri-planes give good renderings (a), the $64\times 64$ variant gives much worse result (b). Hence, we introduce random scaling during fitting so as to obtain a robust representation that can be effectively rendered in continuous scales (c).
  • Figure 3: We propose two mechanisms to ensure coherent tri-plane generation. Our 3D-aware convolution considers the 3D relationship in (a) and correlates the associated elements from separate feature planes as shown in (b). In (b), we also visualize the usage of a shared latent code to orchestrate the feature generation.
  • Figure 4: Unconditional generation samples by our Rodin model. We visualize the mesh extracted from the generated density field.
  • Figure 5: Latent interpolation results for generated avatars.
  • ...and 16 more figures