Table of Contents
Fetching ...

Instant3D: Instant Text-to-3D Generation

Ming Li, Pan Zhou, Jia-Wei Liu, Jussi Keppo, Min Lin, Shuicheng Yan, Xiangyu Xu

TL;DR

Instant3D tackles the bottleneck of slow text-to-3D generation by learning a single text-conditioned NeRF that outputs a triplane in a forward pass. It achieves faithful text alignment and strong multi-view consistency by integrating cross-attention, style injection, and token-to-plane transformation, together with a scaled-sigmoid activation and an adaptive Perp-Neg algorithm to address weak supervision and the Janus problem. The model is trained with SDS and CLIP-based losses without 3D supervision and demonstrates competitive or superior quality while drastically speeding up inference (sub-second) across diverse prompt sets. This approach enables practical, real-time text-to-3D generation for applications in film, AR/VR, and interactive media, without requiring large 3D datasets.

Abstract

Text-to-3D generation has attracted much attention from the computer vision community. Existing methods mainly optimize a neural field from scratch for each text prompt, relying on heavy and repetitive training cost which impedes their practical deployment. In this paper, we propose a novel framework for fast text-to-3D generation, dubbed Instant3D. Once trained, Instant3D is able to create a 3D object for an unseen text prompt in less than one second with a single run of a feedforward network. We achieve this remarkable speed by devising a new network that directly constructs a 3D triplane from a text prompt. The core innovation of our Instant3D lies in our exploration of strategies to effectively inject text conditions into the network. In particular, we propose to combine three key mechanisms: cross-attention, style injection, and token-to-plane transformation, which collectively ensure precise alignment of the output with the input text. Furthermore, we propose a simple yet effective activation function, the scaled-sigmoid, to replace the original sigmoid function, which speeds up the training convergence by more than ten times. Finally, to address the Janus (multi-head) problem in 3D generation, we propose an adaptive Perp-Neg algorithm that can dynamically adjust its concept negation scales according to the severity of the Janus problem during training, effectively reducing the multi-head effect. Extensive experiments on a wide variety of benchmark datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods both qualitatively and quantitatively, while achieving significantly better efficiency. The code, data, and models are available at https://github.com/ming1993li/Instant3DCodes.

Instant3D: Instant Text-to-3D Generation

TL;DR

Instant3D tackles the bottleneck of slow text-to-3D generation by learning a single text-conditioned NeRF that outputs a triplane in a forward pass. It achieves faithful text alignment and strong multi-view consistency by integrating cross-attention, style injection, and token-to-plane transformation, together with a scaled-sigmoid activation and an adaptive Perp-Neg algorithm to address weak supervision and the Janus problem. The model is trained with SDS and CLIP-based losses without 3D supervision and demonstrates competitive or superior quality while drastically speeding up inference (sub-second) across diverse prompt sets. This approach enables practical, real-time text-to-3D generation for applications in film, AR/VR, and interactive media, without requiring large 3D datasets.

Abstract

Text-to-3D generation has attracted much attention from the computer vision community. Existing methods mainly optimize a neural field from scratch for each text prompt, relying on heavy and repetitive training cost which impedes their practical deployment. In this paper, we propose a novel framework for fast text-to-3D generation, dubbed Instant3D. Once trained, Instant3D is able to create a 3D object for an unseen text prompt in less than one second with a single run of a feedforward network. We achieve this remarkable speed by devising a new network that directly constructs a 3D triplane from a text prompt. The core innovation of our Instant3D lies in our exploration of strategies to effectively inject text conditions into the network. In particular, we propose to combine three key mechanisms: cross-attention, style injection, and token-to-plane transformation, which collectively ensure precise alignment of the output with the input text. Furthermore, we propose a simple yet effective activation function, the scaled-sigmoid, to replace the original sigmoid function, which speeds up the training convergence by more than ten times. Finally, to address the Janus (multi-head) problem in 3D generation, we propose an adaptive Perp-Neg algorithm that can dynamically adjust its concept negation scales according to the severity of the Janus problem during training, effectively reducing the multi-head effect. Extensive experiments on a wide variety of benchmark datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods both qualitatively and quantitatively, while achieving significantly better efficiency. The code, data, and models are available at https://github.com/ming1993li/Instant3DCodes.
Paper Structure (22 sections, 13 equations, 22 figures, 2 tables)

This paper contains 22 sections, 13 equations, 22 figures, 2 tables.

Figures (22)

  • Figure 1: Comparison of existing SOTA methods with our proposal. Existing methods optimize a randomly initialized NeRF from scratch for each text prompt, which usually requires more than 10,000 iterations, taking hours to converge. In contrast, our approach is designed to learn a text-conditioned NeRF, which takes much less training computational cost and has strong generalization ability on new prompts. It is able to generate a conditional 3D representation (triplane) of a 3D object for an unseen text prompt in one single pass of a feedforward network, taking about $\textbf{25ms}$.
  • Figure 2: Overview of the proposed Instant3D, which applies a conditional decoder network to map a text prompt to a corresponding triplane. Three condition mechanisms, i.e., cross-attention, style injection, and token-to-plane transformation, are seamlessly combined to bridge text and 3D, tackling the issue of weak supervision from SDS poole2022dreamfusion. Given a random camera pose, a 2D image is rendered from the conditioned triplane through coordinate-based feature sampling, point density and albedo prediction, and differentiable volume rendering. For albedo activation, we propose a scaled-sigmoid function, effectively accelerating the training convergence. During training, the view image is first diffused by adding random noise and then fed into a pretrained UNet conditioned on the text prompt for denoising, which provides the gradient of the SDS loss. We also present an adaptive Perp-Neg algorithm to better solve the Janus problem in our framework. During inference, our Instant3D can infer a faithful 3D object from an unseen text prompt in less than one second.
  • Figure 3: Integrating three condition mechanisms produces high-quality results faithful to the text prompt. The prompt here is "a teddy bear sitting in a basket and wearing a scarf and wearing a baseball cap".
  • Figure 4: Comparison of the original sigmoid function with its stretched variant (left) and the proposed scaled-sigmoid function (right). With $\alpha<1.0$, the scaled-sigmoid function possesses a broader high-gradient region, which accelerates training.
  • Figure 5: Qualitative comparison with the state-of-the-art methods including TextMesh tsalicoglou2023textmesh, SJC wang2023score, DreamFusion poole2022dreamfusion, Latent-NeRF metzer2022latent, ProlificDreamer wang2023prolificdreamer, and Point-E nichol2022point on the Animals dataset. The proposed Instant3D achieves higher-quality results while being much more efficient than the baselines.
  • ...and 17 more figures