Table of Contents
Fetching ...

Turbo3D: Ultra-fast Text-to-3D Generation

Hanzhe Hu, Tianwei Yin, Fujun Luan, Yiwei Hu, Hao Tan, Zexiang Xu, Sai Bi, Shubham Tulsiani, Kai Zhang

TL;DR

Turbo3D tackles ultra-fast, text-conditioned 3D generation by distilling a multi-view diffusion model into a compact few-step generator and reconstructing 3D assets from latent representations. The core contributions are Dual-Teacher Distillation to preserve multi-view consistency and photorealism, and Latent GS-LRM to accelerate MV reconstruction. Experiments on Objaverse demonstrate sub-second inference on A100 with competitive CLIP and VQA alignment, and strong performance at 512 resolution with additional speedups. Overall, Turbo3D narrows the gap between 2D diffusion speed and 3D asset quality, enabling real-time text-to-3D in interactive pipelines.

Abstract

We present Turbo3D, an ultra-fast text-to-3D system capable of generating high-quality Gaussian splatting assets in under one second. Turbo3D employs a rapid 4-step, 4-view diffusion generator and an efficient feed-forward Gaussian reconstructor, both operating in latent space. The 4-step, 4-view generator is a student model distilled through a novel Dual-Teacher approach, which encourages the student to learn view consistency from a multi-view teacher and photo-realism from a single-view teacher. By shifting the Gaussian reconstructor's inputs from pixel space to latent space, we eliminate the extra image decoding time and halve the transformer sequence length for maximum efficiency. Our method demonstrates superior 3D generation results compared to previous baselines, while operating in a fraction of their runtime.

Turbo3D: Ultra-fast Text-to-3D Generation

TL;DR

Turbo3D tackles ultra-fast, text-conditioned 3D generation by distilling a multi-view diffusion model into a compact few-step generator and reconstructing 3D assets from latent representations. The core contributions are Dual-Teacher Distillation to preserve multi-view consistency and photorealism, and Latent GS-LRM to accelerate MV reconstruction. Experiments on Objaverse demonstrate sub-second inference on A100 with competitive CLIP and VQA alignment, and strong performance at 512 resolution with additional speedups. Overall, Turbo3D narrows the gap between 2D diffusion speed and 3D asset quality, enabling real-time text-to-3D in interactive pipelines.

Abstract

We present Turbo3D, an ultra-fast text-to-3D system capable of generating high-quality Gaussian splatting assets in under one second. Turbo3D employs a rapid 4-step, 4-view diffusion generator and an efficient feed-forward Gaussian reconstructor, both operating in latent space. The 4-step, 4-view generator is a student model distilled through a novel Dual-Teacher approach, which encourages the student to learn view consistency from a multi-view teacher and photo-realism from a single-view teacher. By shifting the Gaussian reconstructor's inputs from pixel space to latent space, we eliminate the extra image decoding time and halve the transformer sequence length for maximum efficiency. Our method demonstrates superior 3D generation results compared to previous baselines, while operating in a fraction of their runtime.

Paper Structure

This paper contains 20 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of our Turbo3D text-to-3D system. Turbo3D generates high-quality 3D Gaussian Splatting (3DGS) assets from user prompts in less than 1 second on a single A100 GPU. It's a two-stage pipeline consisting of a highly efficient latent-space few-step multi-view (MV) generator and single-step MV reconstructor. Note that we visualize latents as RGB images and 3DGS assets as point clouds in the pipeline figure for clarity.
  • Figure 2: Dual-teacher distillation framework in our Turbo3D. Note that latents are visualized as RGB images for clarity. We aim to distill a multi-step multi-view teacher generator (right, green) into a few-step multi-view generator (left, blue). Our few-step MV student generator is conditioned on Plücker embeddings for better 3D awareness. Similar to yin2024improved, we optimize the student generator using distribution matching objective (DMD loss) and train the fake score function to model the distribution of samples produced by the student generator. In particular, we integrate two teacher models: multi-view teacher and single-view (SV) teacher to enhance both multi-view consistency and photorealism. The MV score functions take a set of images of one object as input and calculate the MV DMD loss, while the SV score functions treat each image separately and calculate the SV DMD loss.
  • Figure 3: We compare the renderings of pixel GS-LRM and latent GS-LRM. Latent GS-LRM achieves comparable reconstruction quality as pixel GS-LRM.
  • Figure 4: Comparison of our Turbo3D against baselines LGM tang2025lgm and Instant3D li2023instant3d. Among these methods, Our method generates the most detailed and physically plausible 3D assets, closely adhering to the provided text prompts. In contrast, LGM tends to generate broken assets with Janus issue poole2022dreamfusion, while Instant3D has poorer text alignment, oftentimes missing some concepts, e.g., 'spilling out' in the first row, 'river' in the second row, etc.
  • Figure 5: User study results comparing our Turbo3D to baseline LGM tang2025lgm, Instant3D li2023instant3d, and our slow MV teacher. Our Turbo3D is consistently preferred over baseline LGM and Instant3D, while having on-par preference with our MV teacher. See Fig. \ref{['fig:comparison']},\ref{['fig:ablation']} for visual comparison.
  • ...and 2 more figures