HD-Fusion: Detailed Text-to-3D Generation Leveraging Multiple Noise Estimation
Jinbo Wu, Xiaobo Gao, Xing Liu, Zhengyang Shen, Chen Zhao, Haocheng Feng, Jingtuo Liu, Errui Ding
TL;DR
This work tackles the bottleneck of generating high-detail 3D content from text when guided by 2D diffusion priors. It introduces a memory-efficient, tile-based multiple noise estimation method to compute SDS loss on high-resolution renderings and combines it with a two-stage Text-to-3D pipeline that uses 2D diffusion priors and optional ControlNet guidance. The coarse Stage1 neural-field representation is refined in Stage2 with a DMTet-based geometry and a color network, enabling detailed geometry and appearance while maintaining training efficiency. Experiments show improvements in detail and visual quality over baselines and state-of-the-art methods, with ablations highlighting the benefits of tile overlap and pose-guidance for 3D content generation.
Abstract
In this paper, we study Text-to-3D content generation leveraging 2D diffusion priors to enhance the quality and detail of the generated 3D models. Recent progress (Magic3D) in text-to-3D has shown that employing high-resolution (e.g., 512 x 512) renderings can lead to the production of high-quality 3D models using latent diffusion priors. To enable rendering at even higher resolutions, which has the potential to further augment the quality and detail of the models, we propose a novel approach that combines multiple noise estimation processes with a pretrained 2D diffusion prior. Distinct from the Bar-Tal et al.s' study which binds multiple denoised results to generate images from texts, our approach integrates the computation of scoring distillation losses such as SDS loss and VSD loss which are essential techniques for the 3D content generation with 2D diffusion priors. We experimentally evaluated the proposed approach. The results show that the proposed approach can generate high-quality details compared to the baselines.
