BrightDreamer: Generic 3D Gaussian Generative Framework for Fast Text-to-3D Synthesis
Lutao Jiang, Xu Zheng, Yuanhuiyi Lyu, Jiazhou Zhou, Lin Wang
TL;DR
This work tackles the inefficiency of per-prompt optimization in text-to-3D by proposing BrightDreamer, a fast, end-to-end feed-forward model that generalizes to unseen prompts for 3D Gaussian Splatting. It decomposes generation into a Text-guided Shape Deformation (TSD) to place centers and a Text-guided Triplane Generator (TTG) to build spatial features, coupled with a lightweight Gaussian Decoder to produce full Gaussian attributes, trained with Score Distillation Sampling. The approach achieves about 77 ms generation and 705 FPS rendering, demonstrates strong semantic understanding, and shows favorable generalization and transferability over baselines, including rapid post-generation finetuning. By leveraging anchor-based deformation and a refined triplane-based representation, BrightDreamer offers a practical path toward instant, text-driven 3D asset creation without large 3D datasets.
Abstract
Text-to-3D synthesis has recently seen intriguing advances by combining the text-to-image priors with 3D representation methods, e.g., 3D Gaussian Splatting (3D GS), via Score Distillation Sampling (SDS). However, a hurdle of existing methods is the low efficiency, per-prompt optimization for a single 3D object. Therefore, it is imperative for a paradigm shift from per-prompt optimization to feed-forward generation for any unseen text prompts, which yet remains challenging. An obstacle is how to directly generate a set of millions of 3D Gaussians to represent a 3D object. This paper presents BrightDreamer, an end-to-end feed-forward approach that can achieve generalizable and fast (77 ms) text-to-3D generation. Our key idea is to formulate the generation process as estimating the 3D deformation from an anchor shape with predefined positions. For this, we first propose a Text-guided Shape Deformation (TSD) network to predict the deformed shape and its new positions, used as the centers (one attribute) of 3D Gaussians. To estimate the other four attributes (i.e., scaling, rotation, opacity, and SH), we then design a novel Text-guided Triplane Generator (TTG) to generate a triplane representation for a 3D object. The center of each Gaussian enables us to transform the spatial feature into the four attributes. The generated 3D Gaussians can be finally rendered at 705 frames per second. Extensive experiments demonstrate the superiority of our method over existing methods. Also, BrightDreamer possesses a strong semantic understanding capability even for complex text prompts. The code is available in the project page.
