Table of Contents
Fetching ...

Learn to Optimize Denoising Scores for 3D Generation: A Unified and Improved Diffusion Prior on NeRF and 3D Gaussian Splatting

Xiaofeng Yang, Yiwen Chen, Cheng Chen, Chi Zhang, Yi Xu, Xulei Yang, Fayao Liu, Guosheng Lin

TL;DR

This work tackles subpar 3D generation driven by diffusion priors and a training–inference mismatch caused by CFG. It introduces Learn to Optimize Denoising Scores (LODS), a unified framework that jointly optimizes the 3D model parameters and the diffusion prior by adding learnable components (an unconditional embedding $\alpha$ or LoRA $\psi$) to the SDS objective, producing configurations that balance performance and complexity. Through embedding-based and LoRA-based variants, LODS bridges the CFG gap, reduces the floating phenomenon observed with prior priors, and achieves state-of-the-art results on text-to-3D benchmarks across NeRF and 3D Gaussian Splatting backbones, with strong performance in image-to-3D and 2D generation/editing as well. The approach advances practical 3D generation by delivering higher fidelity textures and colors, faster generation with Gaussian Splatting backbones, and a clearer understanding of score-distillation losses (SDS, DDS, VSD) in diffusion-prior optimization.

Abstract

We propose a unified framework aimed at enhancing the diffusion priors for 3D generation tasks. Despite the critical importance of these tasks, existing methodologies often struggle to generate high-caliber results. We begin by examining the inherent limitations in previous diffusion priors. We identify a divergence between the diffusion priors and the training procedures of diffusion models that substantially impairs the quality of 3D generation. To address this issue, we propose a novel, unified framework that iteratively optimizes both the 3D model and the diffusion prior. Leveraging the different learnable parameters of the diffusion prior, our approach offers multiple configurations, affording various trade-offs between performance and implementation complexity. Notably, our experimental results demonstrate that our method markedly surpasses existing techniques, establishing new state-of-the-art in the realm of text-to-3D generation. Furthermore, our approach exhibits impressive performance on both NeRF and the newly introduced 3D Gaussian Splatting backbones. Additionally, our framework yields insightful contributions to the understanding of recent score distillation methods, such as the VSD and DDS loss.

Learn to Optimize Denoising Scores for 3D Generation: A Unified and Improved Diffusion Prior on NeRF and 3D Gaussian Splatting

TL;DR

This work tackles subpar 3D generation driven by diffusion priors and a training–inference mismatch caused by CFG. It introduces Learn to Optimize Denoising Scores (LODS), a unified framework that jointly optimizes the 3D model parameters and the diffusion prior by adding learnable components (an unconditional embedding or LoRA ) to the SDS objective, producing configurations that balance performance and complexity. Through embedding-based and LoRA-based variants, LODS bridges the CFG gap, reduces the floating phenomenon observed with prior priors, and achieves state-of-the-art results on text-to-3D benchmarks across NeRF and 3D Gaussian Splatting backbones, with strong performance in image-to-3D and 2D generation/editing as well. The approach advances practical 3D generation by delivering higher fidelity textures and colors, faster generation with Gaussian Splatting backbones, and a clearer understanding of score-distillation losses (SDS, DDS, VSD) in diffusion-prior optimization.

Abstract

We propose a unified framework aimed at enhancing the diffusion priors for 3D generation tasks. Despite the critical importance of these tasks, existing methodologies often struggle to generate high-caliber results. We begin by examining the inherent limitations in previous diffusion priors. We identify a divergence between the diffusion priors and the training procedures of diffusion models that substantially impairs the quality of 3D generation. To address this issue, we propose a novel, unified framework that iteratively optimizes both the 3D model and the diffusion prior. Leveraging the different learnable parameters of the diffusion prior, our approach offers multiple configurations, affording various trade-offs between performance and implementation complexity. Notably, our experimental results demonstrate that our method markedly surpasses existing techniques, establishing new state-of-the-art in the realm of text-to-3D generation. Furthermore, our approach exhibits impressive performance on both NeRF and the newly introduced 3D Gaussian Splatting backbones. Additionally, our framework yields insightful contributions to the understanding of recent score distillation methods, such as the VSD and DDS loss.
Paper Structure (29 sections, 10 equations, 14 figures, 2 tables, 3 algorithms)

This paper contains 29 sections, 10 equations, 14 figures, 2 tables, 3 algorithms.

Figures (14)

  • Figure 1: Examples of 3D generation results of LODS. Our proposed method is capable of generating 3D objects with exceptional fidelity, showcasing intricate details and remarkably accurate colors.
  • Figure 2: Method overview. Our proposed methods learn to optimize the denoising scores either by optimizing the null embedding or additional low rank parameters of LoRA.
  • Figure 3: Comparison of processing efficiency. Our proposed methods are more efficient compared with the VSD loss.
  • Figure 4: Comparison with other methods on Text-to-3D generation with NeRF. Our method could generate high-quality 3D details compared with the SDS loss and avoid the "floating" problem of the VSD loss. Text prompts: "A DSLR photo of a hamburger", "A roast turkey on a platter", and "An intricately-carved wooden chess set".
  • Figure 5: Ablation experiments of the classifier free guidance weight on text-to-3D generation with NeRF. At low CFG weight, the generated results usually lack texture details. When the CFG weight is excessively increased, it leads to a similar "floating" phenomenon as observed in VSD. Text prompts: "A 3D model of an adorable cottage with a thatched roof", "Sydney opera house", and "A car made out of cheese".
  • ...and 9 more figures