Table of Contents
Fetching ...

BrightDreamer: Generic 3D Gaussian Generative Framework for Fast Text-to-3D Synthesis

Lutao Jiang, Xu Zheng, Yuanhuiyi Lyu, Jiazhou Zhou, Lin Wang

TL;DR

This work tackles the inefficiency of per-prompt optimization in text-to-3D by proposing BrightDreamer, a fast, end-to-end feed-forward model that generalizes to unseen prompts for 3D Gaussian Splatting. It decomposes generation into a Text-guided Shape Deformation (TSD) to place centers and a Text-guided Triplane Generator (TTG) to build spatial features, coupled with a lightweight Gaussian Decoder to produce full Gaussian attributes, trained with Score Distillation Sampling. The approach achieves about 77 ms generation and 705 FPS rendering, demonstrates strong semantic understanding, and shows favorable generalization and transferability over baselines, including rapid post-generation finetuning. By leveraging anchor-based deformation and a refined triplane-based representation, BrightDreamer offers a practical path toward instant, text-driven 3D asset creation without large 3D datasets.

Abstract

Text-to-3D synthesis has recently seen intriguing advances by combining the text-to-image priors with 3D representation methods, e.g., 3D Gaussian Splatting (3D GS), via Score Distillation Sampling (SDS). However, a hurdle of existing methods is the low efficiency, per-prompt optimization for a single 3D object. Therefore, it is imperative for a paradigm shift from per-prompt optimization to feed-forward generation for any unseen text prompts, which yet remains challenging. An obstacle is how to directly generate a set of millions of 3D Gaussians to represent a 3D object. This paper presents BrightDreamer, an end-to-end feed-forward approach that can achieve generalizable and fast (77 ms) text-to-3D generation. Our key idea is to formulate the generation process as estimating the 3D deformation from an anchor shape with predefined positions. For this, we first propose a Text-guided Shape Deformation (TSD) network to predict the deformed shape and its new positions, used as the centers (one attribute) of 3D Gaussians. To estimate the other four attributes (i.e., scaling, rotation, opacity, and SH), we then design a novel Text-guided Triplane Generator (TTG) to generate a triplane representation for a 3D object. The center of each Gaussian enables us to transform the spatial feature into the four attributes. The generated 3D Gaussians can be finally rendered at 705 frames per second. Extensive experiments demonstrate the superiority of our method over existing methods. Also, BrightDreamer possesses a strong semantic understanding capability even for complex text prompts. The code is available in the project page.

BrightDreamer: Generic 3D Gaussian Generative Framework for Fast Text-to-3D Synthesis

TL;DR

This work tackles the inefficiency of per-prompt optimization in text-to-3D by proposing BrightDreamer, a fast, end-to-end feed-forward model that generalizes to unseen prompts for 3D Gaussian Splatting. It decomposes generation into a Text-guided Shape Deformation (TSD) to place centers and a Text-guided Triplane Generator (TTG) to build spatial features, coupled with a lightweight Gaussian Decoder to produce full Gaussian attributes, trained with Score Distillation Sampling. The approach achieves about 77 ms generation and 705 FPS rendering, demonstrates strong semantic understanding, and shows favorable generalization and transferability over baselines, including rapid post-generation finetuning. By leveraging anchor-based deformation and a refined triplane-based representation, BrightDreamer offers a practical path toward instant, text-driven 3D asset creation without large 3D datasets.

Abstract

Text-to-3D synthesis has recently seen intriguing advances by combining the text-to-image priors with 3D representation methods, e.g., 3D Gaussian Splatting (3D GS), via Score Distillation Sampling (SDS). However, a hurdle of existing methods is the low efficiency, per-prompt optimization for a single 3D object. Therefore, it is imperative for a paradigm shift from per-prompt optimization to feed-forward generation for any unseen text prompts, which yet remains challenging. An obstacle is how to directly generate a set of millions of 3D Gaussians to represent a 3D object. This paper presents BrightDreamer, an end-to-end feed-forward approach that can achieve generalizable and fast (77 ms) text-to-3D generation. Our key idea is to formulate the generation process as estimating the 3D deformation from an anchor shape with predefined positions. For this, we first propose a Text-guided Shape Deformation (TSD) network to predict the deformed shape and its new positions, used as the centers (one attribute) of 3D Gaussians. To estimate the other four attributes (i.e., scaling, rotation, opacity, and SH), we then design a novel Text-guided Triplane Generator (TTG) to generate a triplane representation for a 3D object. The center of each Gaussian enables us to transform the spatial feature into the four attributes. The generated 3D Gaussians can be finally rendered at 705 frames per second. Extensive experiments demonstrate the superiority of our method over existing methods. Also, BrightDreamer possesses a strong semantic understanding capability even for complex text prompts. The code is available in the project page.
Paper Structure (23 sections, 6 equations, 51 figures, 2 tables, 2 algorithms)

This paper contains 23 sections, 6 equations, 51 figures, 2 tables, 2 algorithms.

Figures (51)

  • Figure 1: A comparison between per-prompt optimization-based methods, and our feed-forward generation-based approach with an end-to-end objective.(a) Optimization-based methods directly initialize a 3D representation model, e.g.3D Gaussian Splatting (GS). This process usually suffers from slow per-sample optimization (e.g., several hours for a single text). (b) By contrast, once trained, our approach directly generates 3D content for any unseen text prompt in 77 ms with a single run of a feed-forward of our generator.
  • Figure 2: DreamGaussian tang2023dreamgaussian and LucidDreamer liang2023luciddreamer are both optimized for a single text. Our result is the direct generation. And for the display of our generalization, all the prompts do not appear in our training set. (a) is for showing the complex text understanding. (b) is to demonstrate our capability of understanding details. It is noteworthy that light purple, deep purple, and light yellow don't appear in the training set. (c) Interpolation between two prompts from color and shape perspectives.
  • Figure 3: An overview of BrightDreamer. The details of Spatial Transformer, ResConv Block and Upsample Block are shown in Fig. \ref{['fig:Blocks']}.
  • Figure 4: The visualization of expanding 2D convolution kernel (blue area) to 3D and its moving process in previous convolutional triplane generator chan2022efficient. We use $1 \times 1$ convolutional kernel as an example. Only several positions are interacted, which leads to spatial inhomogeneous.
  • Figure 5: A detailed illustration of specific blocks. (a) Spatial Transformer Block. (b) Residual Convolutional Block. (c) Upsample Bock.
  • ...and 46 more figures