Table of Contents
Fetching ...

Dreamer XL: Towards High-Resolution Text-to-3D Generation via Trajectory Score Matching

Xingyu Miao, Haoran Duan, Varun Ojha, Jun Song, Tejal Shah, Yang Long, Rajiv Ranjan

TL;DR

Dreamer XL introduces Trajectory Score Matching (TSM) to address pseudo ground truth inconsistency caused by accumulated errors in DDIM inversion used by Interval Score Matching (ISM). By running dual diffusion trajectories from the same starting latent and minimizing $L_{\text{TSM}}(\theta)=\mathbb{E}_{t,c}[\omega(t) \| \epsilon_\phi(x_t,t,y) - \epsilon_\phi(x_\mu,\mu,\emptyset) \|^2]$ with $\mu = \gamma(t-s)+s$, TSM reduces error accumulation and treats ISM as a special case. The method leverages Stable Diffusion XL (SDXL) for high-resolution guidance (1024×1024) in 3D Gaussian splatting and introduces a pixel-by-pixel gradient clipping strategy to stabilize gradients during SDXL optimization. Theoretical support is provided via Theorem 1, which formalizes the reduced error for the dual-path trajectory, and empirical results show substantial improvements in visual quality and consistency over state-of-the-art baselines. Overall, Dreamer XL delivers high-quality, detailed text-to-3D generation with fewer artifacts, enabling more practical high-resolution 3D content creation on standard hardware.

Abstract

In this work, we propose a novel Trajectory Score Matching (TSM) method that aims to solve the pseudo ground truth inconsistency problem caused by the accumulated error in Interval Score Matching (ISM) when using the Denoising Diffusion Implicit Models (DDIM) inversion process. Unlike ISM which adopts the inversion process of DDIM to calculate on a single path, our TSM method leverages the inversion process of DDIM to generate two paths from the same starting point for calculation. Since both paths start from the same starting point, TSM can reduce the accumulated error compared to ISM, thus alleviating the problem of pseudo ground truth inconsistency. TSM enhances the stability and consistency of the model's generated paths during the distillation process. We demonstrate this experimentally and further show that ISM is a special case of TSM. Furthermore, to optimize the current multi-stage optimization process from high-resolution text to 3D generation, we adopt Stable Diffusion XL for guidance. In response to the issues of abnormal replication and splitting caused by unstable gradients during the 3D Gaussian splatting process when using Stable Diffusion XL, we propose a pixel-by-pixel gradient clipping method. Extensive experiments show that our model significantly surpasses the state-of-the-art models in terms of visual quality and performance. Code: \url{https://github.com/xingy038/Dreamer-XL}.

Dreamer XL: Towards High-Resolution Text-to-3D Generation via Trajectory Score Matching

TL;DR

Dreamer XL introduces Trajectory Score Matching (TSM) to address pseudo ground truth inconsistency caused by accumulated errors in DDIM inversion used by Interval Score Matching (ISM). By running dual diffusion trajectories from the same starting latent and minimizing with , TSM reduces error accumulation and treats ISM as a special case. The method leverages Stable Diffusion XL (SDXL) for high-resolution guidance (1024×1024) in 3D Gaussian splatting and introduces a pixel-by-pixel gradient clipping strategy to stabilize gradients during SDXL optimization. Theoretical support is provided via Theorem 1, which formalizes the reduced error for the dual-path trajectory, and empirical results show substantial improvements in visual quality and consistency over state-of-the-art baselines. Overall, Dreamer XL delivers high-quality, detailed text-to-3D generation with fewer artifacts, enabling more practical high-resolution 3D content creation on standard hardware.

Abstract

In this work, we propose a novel Trajectory Score Matching (TSM) method that aims to solve the pseudo ground truth inconsistency problem caused by the accumulated error in Interval Score Matching (ISM) when using the Denoising Diffusion Implicit Models (DDIM) inversion process. Unlike ISM which adopts the inversion process of DDIM to calculate on a single path, our TSM method leverages the inversion process of DDIM to generate two paths from the same starting point for calculation. Since both paths start from the same starting point, TSM can reduce the accumulated error compared to ISM, thus alleviating the problem of pseudo ground truth inconsistency. TSM enhances the stability and consistency of the model's generated paths during the distillation process. We demonstrate this experimentally and further show that ISM is a special case of TSM. Furthermore, to optimize the current multi-stage optimization process from high-resolution text to 3D generation, we adopt Stable Diffusion XL for guidance. In response to the issues of abnormal replication and splitting caused by unstable gradients during the 3D Gaussian splatting process when using Stable Diffusion XL, we propose a pixel-by-pixel gradient clipping method. Extensive experiments show that our model significantly surpasses the state-of-the-art models in terms of visual quality and performance. Code: \url{https://github.com/xingy038/Dreamer-XL}.
Paper Structure (31 sections, 12 equations, 11 figures, 1 table, 1 algorithm)

This paper contains 31 sections, 12 equations, 11 figures, 1 table, 1 algorithm.

Figures (11)

  • Figure 1: Example of text-to-3D content generated from scratch by our Dreamer XL. Our Dreamer XL is based on 3D Gaussian splatting using stable diffusion XL. Please zoom in for details.
  • Figure 2: ISM example liang2023luciddreamer. We notice that using the same initial value $x_0$ but under different noise $\{\epsilon_1,\epsilon_2,\epsilon_3,\epsilon_4\}$, the generated results still show certain inconsistencies. This is due to the error accumulation inherent in the DDIM inversion process. These inconsistencies can lead to errors or inconsistencies in some areas during the optimization of the 3D model.
  • Figure 3: Comparison with state-of-the-art baseline methods in text-to-3D generation. Experimental results show that our method can generate 3D content that is more consistent with input text prompts and has more detailed details. All results of this work are generated on a single A100 GPU. Please zoom in to see more details.
  • Figure 4: Comparison with the generation results of different stable diffusion models. Compared with ISM, our TSM performs better in the clarity and consistency of local details. Please zoom in to see the circled region for more details.
  • Figure 5: Ablation on offset rate. $\gamma=0.3$ achieves optimal visual quality and ensures high consistency between the generated results and the original text.
  • ...and 6 more figures