Table of Contents
Fetching ...

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

Bowen Zhang, Yiji Cheng, Chunyu Wang, Ting Zhang, Jiaolong Yang, Yansong Tang, Feng Zhao, Dong Chen, Baining Guo

TL;DR

RodinHD tackles the challenge of high-fidelity 3D avatar generation from single portraits by addressing catastrophic forgetting during triplane fitting with a task replay strategy and an identity-aware weight consolidation (IWC) regularizer. It pairs this robust fitting stage with a cascaded diffusion model conditioned on rich, multi-scale portrait features injected via cross-attention, and an optimized noise schedule tailored for high-resolution triplanes. On a 46K-avatar dataset, RodinHD delivers sharper details, improved hair and clothing textures, and stronger cross-view consistency than prior methods, while generalizing to in-the-wild portrait inputs and enabling unconditional generation without a 2D refiner. The approach demonstrates significant practical impact for scalable, identity-preserving 3D avatar synthesis and offers a generalizable framework for high-detail 3D diffusion from 2D cues, with potential extensions to other 3D tasks. Key innovations include task replay, IWC regularization, multi-scale VAE-based conditioning, cross-attention-based fusion, and tuned noise schedules, all contributing to notable gains in fidelity and 3D coherence.

Abstract

We present RodinHD, which can generate high-fidelity 3D avatars from a portrait image. Existing methods fail to capture intricate details such as hairstyles which we tackle in this paper. We first identify an overlooked problem of catastrophic forgetting that arises when fitting triplanes sequentially on many avatars, caused by the MLP decoder sharing scheme. To overcome this issue, we raise a novel data scheduling strategy and a weight consolidation regularization term, which improves the decoder's capability of rendering sharper details. Additionally, we optimize the guiding effect of the portrait image by computing a finer-grained hierarchical representation that captures rich 2D texture cues, and injecting them to the 3D diffusion model at multiple layers via cross-attention. When trained on 46K avatars with a noise schedule optimized for triplanes, the resulting model can generate 3D avatars with notably better details than previous methods and can generalize to in-the-wild portrait input.

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

TL;DR

RodinHD tackles the challenge of high-fidelity 3D avatar generation from single portraits by addressing catastrophic forgetting during triplane fitting with a task replay strategy and an identity-aware weight consolidation (IWC) regularizer. It pairs this robust fitting stage with a cascaded diffusion model conditioned on rich, multi-scale portrait features injected via cross-attention, and an optimized noise schedule tailored for high-resolution triplanes. On a 46K-avatar dataset, RodinHD delivers sharper details, improved hair and clothing textures, and stronger cross-view consistency than prior methods, while generalizing to in-the-wild portrait inputs and enabling unconditional generation without a 2D refiner. The approach demonstrates significant practical impact for scalable, identity-preserving 3D avatar synthesis and offers a generalizable framework for high-detail 3D diffusion from 2D cues, with potential extensions to other 3D tasks. Key innovations include task replay, IWC regularization, multi-scale VAE-based conditioning, cross-attention-based fusion, and tuned noise schedules, all contributing to notable gains in fidelity and 3D coherence.

Abstract

We present RodinHD, which can generate high-fidelity 3D avatars from a portrait image. Existing methods fail to capture intricate details such as hairstyles which we tackle in this paper. We first identify an overlooked problem of catastrophic forgetting that arises when fitting triplanes sequentially on many avatars, caused by the MLP decoder sharing scheme. To overcome this issue, we raise a novel data scheduling strategy and a weight consolidation regularization term, which improves the decoder's capability of rendering sharper details. Additionally, we optimize the guiding effect of the portrait image by computing a finer-grained hierarchical representation that captures rich 2D texture cues, and injecting them to the 3D diffusion model at multiple layers via cross-attention. When trained on 46K avatars with a noise schedule optimized for triplanes, the resulting model can generate 3D avatars with notably better details than previous methods and can generalize to in-the-wild portrait input.
Paper Structure (18 sections, 7 equations, 21 figures, 9 tables, 1 algorithm)

This paper contains 18 sections, 7 equations, 21 figures, 9 tables, 1 algorithm.

Figures (21)

  • Figure 1: RodinHD generates detailed 3D avatars from single portrait images (dashed boxes) without compromising cross-view consistency (first row). It also supports text-conditioned (second row left) or unconditional (second row right) generation.
  • Figure 2: Catastrophic forgetting. As training proceeds, decoder gradually forgets the knowledge learned on the previous avatars of $1 \& 4$ and is overly adapted to avatar $9$.
  • Figure 3: Overview of our method.
  • Figure 4: Frequency difference between two sources. Left: Triplanes have more high-frequency components than images. Right: Triplanes learned with our proposed IWC have fewer high-frequency components.
  • Figure 5: Rendered images from triplanes with 8 and 32 channels, respectively. The triplanes are destructed with the same noise level ($\text{logSNR}(t) = 0.57$). The 32-channel triplane has larger redundancy so it is less destructed.
  • ...and 16 more figures