RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models
Bowen Zhang, Yiji Cheng, Chunyu Wang, Ting Zhang, Jiaolong Yang, Yansong Tang, Feng Zhao, Dong Chen, Baining Guo
TL;DR
RodinHD tackles the challenge of high-fidelity 3D avatar generation from single portraits by addressing catastrophic forgetting during triplane fitting with a task replay strategy and an identity-aware weight consolidation (IWC) regularizer. It pairs this robust fitting stage with a cascaded diffusion model conditioned on rich, multi-scale portrait features injected via cross-attention, and an optimized noise schedule tailored for high-resolution triplanes. On a 46K-avatar dataset, RodinHD delivers sharper details, improved hair and clothing textures, and stronger cross-view consistency than prior methods, while generalizing to in-the-wild portrait inputs and enabling unconditional generation without a 2D refiner. The approach demonstrates significant practical impact for scalable, identity-preserving 3D avatar synthesis and offers a generalizable framework for high-detail 3D diffusion from 2D cues, with potential extensions to other 3D tasks. Key innovations include task replay, IWC regularization, multi-scale VAE-based conditioning, cross-attention-based fusion, and tuned noise schedules, all contributing to notable gains in fidelity and 3D coherence.
Abstract
We present RodinHD, which can generate high-fidelity 3D avatars from a portrait image. Existing methods fail to capture intricate details such as hairstyles which we tackle in this paper. We first identify an overlooked problem of catastrophic forgetting that arises when fitting triplanes sequentially on many avatars, caused by the MLP decoder sharing scheme. To overcome this issue, we raise a novel data scheduling strategy and a weight consolidation regularization term, which improves the decoder's capability of rendering sharper details. Additionally, we optimize the guiding effect of the portrait image by computing a finer-grained hierarchical representation that captures rich 2D texture cues, and injecting them to the 3D diffusion model at multiple layers via cross-attention. When trained on 46K avatars with a noise schedule optimized for triplanes, the resulting model can generate 3D avatars with notably better details than previous methods and can generalize to in-the-wild portrait input.
