Table of Contents
Fetching ...

Walking the Schrödinger Bridge: A Direct Trajectory for Text-to-3D Generation

Ziying Li, Xuequan Lu, Xinkui Zhao, Guanjie Cheng, Shuiguang Deng, Jianwei Yin

TL;DR

This work addresses the artifacts and reliability issues in SDS-based text-to-3D generation by reframing SDS as a special case of the Schrödinger Bridge and introducing Trajectory-Centric Distillation (TraCe). TraCe explicitly constructs a diffusion bridge from the current 3D render to a text-conditioned target and trains a LoRA-adapted diffusion model to learn the bridge's score dynamics, enabling robust optimization at lower CFG. The approach yields higher-fidelity 3D assets with better semantic alignment and texture detail, and demonstrates strong robustness across CFG values, including competitive performance at moderate CFG. Overall, TraCe provides a principled, direct-trajectory optimization framework for text-to-3D generation with practical improvements in quality, efficiency, and stability.

Abstract

Recent advancements in optimization-based text-to-3D generation heavily rely on distilling knowledge from pre-trained text-to-image diffusion models using techniques like Score Distillation Sampling (SDS), which often introduce artifacts such as over-saturation and over-smoothing into the generated 3D assets. In this paper, we address this essential problem by formulating the generation process as learning an optimal, direct transport trajectory between the distribution of the current rendering and the desired target distribution, thereby enabling high-quality generation with smaller Classifier-free Guidance (CFG) values. At first, we theoretically establish SDS as a simplified instance of the Schrödinger Bridge framework. We prove that SDS employs the reverse process of an Schrödinger Bridge, which, under specific conditions (e.g., a Gaussian noise as one end), collapses to SDS's score function of the pre-trained diffusion model. Based upon this, we introduce Trajectory-Centric Distillation (TraCe), a novel text-to-3D generation framework, which reformulates the mathematically trackable framework of Schrödinger Bridge to explicitly construct a diffusion bridge from the current rendering to its text-conditioned, denoised target, and trains a LoRA-adapted model on this trajectory's score dynamics for robust 3D optimization. Comprehensive experiments demonstrate that TraCe consistently achieves superior quality and fidelity to state-of-the-art techniques.

Walking the Schrödinger Bridge: A Direct Trajectory for Text-to-3D Generation

TL;DR

This work addresses the artifacts and reliability issues in SDS-based text-to-3D generation by reframing SDS as a special case of the Schrödinger Bridge and introducing Trajectory-Centric Distillation (TraCe). TraCe explicitly constructs a diffusion bridge from the current 3D render to a text-conditioned target and trains a LoRA-adapted diffusion model to learn the bridge's score dynamics, enabling robust optimization at lower CFG. The approach yields higher-fidelity 3D assets with better semantic alignment and texture detail, and demonstrates strong robustness across CFG values, including competitive performance at moderate CFG. Overall, TraCe provides a principled, direct-trajectory optimization framework for text-to-3D generation with practical improvements in quality, efficiency, and stability.

Abstract

Recent advancements in optimization-based text-to-3D generation heavily rely on distilling knowledge from pre-trained text-to-image diffusion models using techniques like Score Distillation Sampling (SDS), which often introduce artifacts such as over-saturation and over-smoothing into the generated 3D assets. In this paper, we address this essential problem by formulating the generation process as learning an optimal, direct transport trajectory between the distribution of the current rendering and the desired target distribution, thereby enabling high-quality generation with smaller Classifier-free Guidance (CFG) values. At first, we theoretically establish SDS as a simplified instance of the Schrödinger Bridge framework. We prove that SDS employs the reverse process of an Schrödinger Bridge, which, under specific conditions (e.g., a Gaussian noise as one end), collapses to SDS's score function of the pre-trained diffusion model. Based upon this, we introduce Trajectory-Centric Distillation (TraCe), a novel text-to-3D generation framework, which reformulates the mathematically trackable framework of Schrödinger Bridge to explicitly construct a diffusion bridge from the current rendering to its text-conditioned, denoised target, and trains a LoRA-adapted model on this trajectory's score dynamics for robust 3D optimization. Comprehensive experiments demonstrate that TraCe consistently achieves superior quality and fidelity to state-of-the-art techniques.

Paper Structure

This paper contains 22 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: From left to right: (a) Standard VSD wang2023prolificdreamer (CFG = 7.5, CFG: Classifier-free Guidance); (b) Standard SDS poole2022dreamfusion; (CFG = 100); (c) VSD wang2023prolificdreamer (CFG = 20); (d) SDS poole2022dreamfusion (CFG = 20); (e) Ours (CFG = 20). VSD with CFG = 7.5 and CFG = 20 both yield low-quality results. Standard SDS yields artifacts (e.g., over-smoothing) with high CFG, and SDS with low CFG yields low-quality results. Our method generates high-quality and high-fidelity results with a fair CFG value.
  • Figure 2: Left: Schrödinger Bridge Visualization and Samples. Top: Probability flow of the bridge from current rendering ($x_{\mathrm{rndr}}$) to the predicted target ($x_0^{\mathrm{pred}}$) distribution. Bottom: Corresponding image samples, showing the current rendering, intermediate bridge samples ($x_t^i$), and the final predicted target. Right: Gradient and Intermediate Rendering Comparison. The first row shows TraCe gradients, the second shows SDS gradients, and the third shows rendered images of the 3D models that have not finished generation. Note the reduced artifacts and potentially more coherent structure in the TraCe gradients and intermediate renderings.
  • Figure 3: Overview of Trajectory-Centric Distillation (TraCe). Our TraCe optimizes 3D parameters $\theta$ by computing a distillation gradient with a LoRA-adapted 2D diffusion model, $\epsilon_{\phi}$. Given a text prompt $y$ and camera parameters $c$, (1) the current 3D model is rendered in a random view to produce $x_{\text{rndr}}$. (2) An ideal target view $x_0^{\text{pred}}$ is estimated from $x_{\text{rndr}}$ using a pre-trained diffusion model $\epsilon_{\text{pretrain}}$ via one-step denoising. (3) An intermediate latent $x_t$ is sampled from the analytic bridge posterior $q(x_t \mid x_0^{\text{pred}}, x_{\text{rndr}})$ at time $t$. (4) The LoRA model $\epsilon_{\phi}$ predicts the noise for $x_t$, and the difference between this prediction and the target noise is computed. (5) This difference directs the calculation of the TraCe gradient $\nabla_{\theta} \mathcal{L}_{\text{TraCe}}$, and drives the update of LoRA parameters $\phi$.
  • Figure 4: Qualitative comparisons. We present visual examples with the same text prompt.
  • Figure 5: Ablation study on our framework.
  • ...and 1 more figures