Turbo3D: Ultra-fast Text-to-3D Generation
Hanzhe Hu, Tianwei Yin, Fujun Luan, Yiwei Hu, Hao Tan, Zexiang Xu, Sai Bi, Shubham Tulsiani, Kai Zhang
TL;DR
Turbo3D tackles ultra-fast, text-conditioned 3D generation by distilling a multi-view diffusion model into a compact few-step generator and reconstructing 3D assets from latent representations. The core contributions are Dual-Teacher Distillation to preserve multi-view consistency and photorealism, and Latent GS-LRM to accelerate MV reconstruction. Experiments on Objaverse demonstrate sub-second inference on A100 with competitive CLIP and VQA alignment, and strong performance at 512 resolution with additional speedups. Overall, Turbo3D narrows the gap between 2D diffusion speed and 3D asset quality, enabling real-time text-to-3D in interactive pipelines.
Abstract
We present Turbo3D, an ultra-fast text-to-3D system capable of generating high-quality Gaussian splatting assets in under one second. Turbo3D employs a rapid 4-step, 4-view diffusion generator and an efficient feed-forward Gaussian reconstructor, both operating in latent space. The 4-step, 4-view generator is a student model distilled through a novel Dual-Teacher approach, which encourages the student to learn view consistency from a multi-view teacher and photo-realism from a single-view teacher. By shifting the Gaussian reconstructor's inputs from pixel space to latent space, we eliminate the extra image decoding time and halve the transformer sequence length for maximum efficiency. Our method demonstrates superior 3D generation results compared to previous baselines, while operating in a fraction of their runtime.
