Table of Contents
Fetching ...

HD-Fusion: Detailed Text-to-3D Generation Leveraging Multiple Noise Estimation

Jinbo Wu, Xiaobo Gao, Xing Liu, Zhengyang Shen, Chen Zhao, Haocheng Feng, Jingtuo Liu, Errui Ding

TL;DR

This work tackles the bottleneck of generating high-detail 3D content from text when guided by 2D diffusion priors. It introduces a memory-efficient, tile-based multiple noise estimation method to compute SDS loss on high-resolution renderings and combines it with a two-stage Text-to-3D pipeline that uses 2D diffusion priors and optional ControlNet guidance. The coarse Stage1 neural-field representation is refined in Stage2 with a DMTet-based geometry and a color network, enabling detailed geometry and appearance while maintaining training efficiency. Experiments show improvements in detail and visual quality over baselines and state-of-the-art methods, with ablations highlighting the benefits of tile overlap and pose-guidance for 3D content generation.

Abstract

In this paper, we study Text-to-3D content generation leveraging 2D diffusion priors to enhance the quality and detail of the generated 3D models. Recent progress (Magic3D) in text-to-3D has shown that employing high-resolution (e.g., 512 x 512) renderings can lead to the production of high-quality 3D models using latent diffusion priors. To enable rendering at even higher resolutions, which has the potential to further augment the quality and detail of the models, we propose a novel approach that combines multiple noise estimation processes with a pretrained 2D diffusion prior. Distinct from the Bar-Tal et al.s' study which binds multiple denoised results to generate images from texts, our approach integrates the computation of scoring distillation losses such as SDS loss and VSD loss which are essential techniques for the 3D content generation with 2D diffusion priors. We experimentally evaluated the proposed approach. The results show that the proposed approach can generate high-quality details compared to the baselines.

HD-Fusion: Detailed Text-to-3D Generation Leveraging Multiple Noise Estimation

TL;DR

This work tackles the bottleneck of generating high-detail 3D content from text when guided by 2D diffusion priors. It introduces a memory-efficient, tile-based multiple noise estimation method to compute SDS loss on high-resolution renderings and combines it with a two-stage Text-to-3D pipeline that uses 2D diffusion priors and optional ControlNet guidance. The coarse Stage1 neural-field representation is refined in Stage2 with a DMTet-based geometry and a color network, enabling detailed geometry and appearance while maintaining training efficiency. Experiments show improvements in detail and visual quality over baselines and state-of-the-art methods, with ablations highlighting the benefits of tile overlap and pose-guidance for 3D content generation.

Abstract

In this paper, we study Text-to-3D content generation leveraging 2D diffusion priors to enhance the quality and detail of the generated 3D models. Recent progress (Magic3D) in text-to-3D has shown that employing high-resolution (e.g., 512 x 512) renderings can lead to the production of high-quality 3D models using latent diffusion priors. To enable rendering at even higher resolutions, which has the potential to further augment the quality and detail of the models, we propose a novel approach that combines multiple noise estimation processes with a pretrained 2D diffusion prior. Distinct from the Bar-Tal et al.s' study which binds multiple denoised results to generate images from texts, our approach integrates the computation of scoring distillation losses such as SDS loss and VSD loss which are essential techniques for the 3D content generation with 2D diffusion priors. We experimentally evaluated the proposed approach. The results show that the proposed approach can generate high-quality details compared to the baselines.
Paper Structure (11 sections, 6 equations, 8 figures, 1 algorithm)

This paper contains 11 sections, 6 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: The proposed multiple noise estimation approach yields high-quality 3D models with enhanced details. The prompts for the three contents are 1) "A model of a house in Tudor style.", 2) "A delicious croissant." and 3) "a ripe strawberry."
  • Figure 2: The proposed multiple noise estimation is illustrated here. The "Latent" represents the latent representation of a rendered image. The "Noise $\varepsilon$" means the additive noise sampled at step $t$ from the diffusion process diffusion. The "Tiled noisy latent" is obtained by cropping overlapping patches from the "Noisy latent" with a sliding window. The "Controlled SD-UNet" means stable diffusion model optionally powered by an instance of ControlNet. The "Estimated noises" is produced by consolidating all the estimated tiles of noise.
  • Figure 3: The coarse stage of the proposed approach.
  • Figure 4: This figure illustrates the fine-level generation stage of the proposed approach. It has two phases, denoted by P1 and P2. We learn geometry and color in P1 and P2 separately. The DMTet model is optimized in P1 and is fixed in P2. The proposed multiple noise estimation is only applied in P2.
  • Figure 5: Result comparison. "Ours" means the results generated with the proposed multiple noise estimation. "Baseline" means the results generated without applying the multiple noise estimation.
  • ...and 3 more figures