Walking the Schrödinger Bridge: A Direct Trajectory for Text-to-3D Generation
Ziying Li, Xuequan Lu, Xinkui Zhao, Guanjie Cheng, Shuiguang Deng, Jianwei Yin
TL;DR
This work addresses the artifacts and reliability issues in SDS-based text-to-3D generation by reframing SDS as a special case of the Schrödinger Bridge and introducing Trajectory-Centric Distillation (TraCe). TraCe explicitly constructs a diffusion bridge from the current 3D render to a text-conditioned target and trains a LoRA-adapted diffusion model to learn the bridge's score dynamics, enabling robust optimization at lower CFG. The approach yields higher-fidelity 3D assets with better semantic alignment and texture detail, and demonstrates strong robustness across CFG values, including competitive performance at moderate CFG. Overall, TraCe provides a principled, direct-trajectory optimization framework for text-to-3D generation with practical improvements in quality, efficiency, and stability.
Abstract
Recent advancements in optimization-based text-to-3D generation heavily rely on distilling knowledge from pre-trained text-to-image diffusion models using techniques like Score Distillation Sampling (SDS), which often introduce artifacts such as over-saturation and over-smoothing into the generated 3D assets. In this paper, we address this essential problem by formulating the generation process as learning an optimal, direct transport trajectory between the distribution of the current rendering and the desired target distribution, thereby enabling high-quality generation with smaller Classifier-free Guidance (CFG) values. At first, we theoretically establish SDS as a simplified instance of the Schrödinger Bridge framework. We prove that SDS employs the reverse process of an Schrödinger Bridge, which, under specific conditions (e.g., a Gaussian noise as one end), collapses to SDS's score function of the pre-trained diffusion model. Based upon this, we introduce Trajectory-Centric Distillation (TraCe), a novel text-to-3D generation framework, which reformulates the mathematically trackable framework of Schrödinger Bridge to explicitly construct a diffusion bridge from the current rendering to its text-conditioned, denoised target, and trains a LoRA-adapted model on this trajectory's score dynamics for robust 3D optimization. Comprehensive experiments demonstrate that TraCe consistently achieves superior quality and fidelity to state-of-the-art techniques.
