Table of Contents
Fetching ...

Text-to-3D Generation by 2D Editing

Haoran Li, Yuli Tian, Yonghui Wang, Yong Liao, Lin Wang, Yuyang Wang, Peng Yuan Zhou

TL;DR

This work analyzes the bottlenecks of SDS-based text-to-3D generation, identifying single-step denoising as a key source of directional errors that lead to over-saturation, over-smoothing, and limited content. It then proposes GE3D, a multi-step 2D diffusion editing framework that aligns latents along both a noising trajectory and a text-guided denoising trajectory, using an $n$-step process and a dynamic balancing coefficient to distill information across multiple granularities into 3D Gaussians. Empirically, GE3D achieves photorealistic, diverse 3D outputs with faster convergence and improved quantitative metrics (e.g., CLIP similarity, FID, BRISQUE) compared with state-of-the-art baselines. The approach unifies 2D editing techniques with 3D generation, offering a practical pathway to higher-quality 3D content and opening avenues for integrating advanced 2D editing methods into 3D synthesis.

Abstract

Distilling 3D representations from pretrained 2D diffusion models is essential for 3D creative applications across gaming, film, and interior design. Current SDS-based methods are hindered by inefficient information distillation from diffusion models, which prevents the creation of photorealistic 3D contents. In this paper, we first reevaluate the SDS approach by analyzing its fundamental nature as a basic image editing process that commonly results in over-saturation, over-smoothing, lack of rich content and diversity due to the poor-quality single-step denoising. In light of this, we then propose a novel method called 3D Generation by Editing (GE3D). Each iteration of GE3D utilizes a 2D editing framework that combines a noising trajectory to preserve the information of the input image, alongside a text-guided denoising trajectory. We optimize the process by aligning the latents across both trajectories. This approach fully exploits pretrained diffusion models to distill multi-granularity information through multiple denoising steps, resulting in photorealistic 3D outputs. Both theoretical and experimental results confirm the effectiveness of our approach, which not only advances 3D generation technology but also establishes a novel connection between 3D generation and 2D editing. This could potentially inspire further research in the field. Code and demos are released at https://jahnsonblack.github.io/GE3D/.

Text-to-3D Generation by 2D Editing

TL;DR

This work analyzes the bottlenecks of SDS-based text-to-3D generation, identifying single-step denoising as a key source of directional errors that lead to over-saturation, over-smoothing, and limited content. It then proposes GE3D, a multi-step 2D diffusion editing framework that aligns latents along both a noising trajectory and a text-guided denoising trajectory, using an -step process and a dynamic balancing coefficient to distill information across multiple granularities into 3D Gaussians. Empirically, GE3D achieves photorealistic, diverse 3D outputs with faster convergence and improved quantitative metrics (e.g., CLIP similarity, FID, BRISQUE) compared with state-of-the-art baselines. The approach unifies 2D editing techniques with 3D generation, offering a practical pathway to higher-quality 3D content and opening avenues for integrating advanced 2D editing methods into 3D synthesis.

Abstract

Distilling 3D representations from pretrained 2D diffusion models is essential for 3D creative applications across gaming, film, and interior design. Current SDS-based methods are hindered by inefficient information distillation from diffusion models, which prevents the creation of photorealistic 3D contents. In this paper, we first reevaluate the SDS approach by analyzing its fundamental nature as a basic image editing process that commonly results in over-saturation, over-smoothing, lack of rich content and diversity due to the poor-quality single-step denoising. In light of this, we then propose a novel method called 3D Generation by Editing (GE3D). Each iteration of GE3D utilizes a 2D editing framework that combines a noising trajectory to preserve the information of the input image, alongside a text-guided denoising trajectory. We optimize the process by aligning the latents across both trajectories. This approach fully exploits pretrained diffusion models to distill multi-granularity information through multiple denoising steps, resulting in photorealistic 3D outputs. Both theoretical and experimental results confirm the effectiveness of our approach, which not only advances 3D generation technology but also establishes a novel connection between 3D generation and 2D editing. This could potentially inspire further research in the field. Code and demos are released at https://jahnsonblack.github.io/GE3D/.

Paper Structure

This paper contains 26 sections, 17 equations, 15 figures, 1 table, 1 algorithm.

Figures (15)

  • Figure 1: The text-to-3D generation results of our framework GE3D. GE3D replaced the inefficient single-step editing in SDS poole2022dreamfusion with multi-step 2D diffusion editing approach. This change mitigates the issues of over-saturation and over-smoothing, enriching image content and increasing diversity, thereby achieving photorealistic generation quality. Please zoom in to see the details.
  • Figure 2: We simulate the 3D generation processes of SDS poole2022dreamfusion and GE3D by inputting 2D coarse images through multiple iterations of single-step and multi-step editing. Here, $x_0$ is the input image, $\tilde{x}_0$ is the predicted image, and $x_t$ and $\tilde{x}_t$ are the latents along the noising and denoising trajectories, respectively.
  • Figure 3: The overview of GE3D. We integrated the 2D editing process into 3D generation. Unlike the single-step editing in SDS in DreamFusion poole2022dreamfusion and ISM in LucidDreamer liang2023luciddreamer, we used multi-step editing with latents alignment to combine different granularities of information from the pre-trained 2D diffusion model into the 3D representation, achieving high-quality generation.
  • Figure 4: Qualitative comparison with baselines in text-to-3D generation.
  • Figure 5: The convergence speed of different text-to-3d methods. Due to the excessive time consumption, we did not plot ProlificDreamer's results.
  • ...and 10 more figures