DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

Qihao Liu; Yi Zhang; Song Bai; Adam Kortylewski; Alan Yuille

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

Qihao Liu, Yi Zhang, Song Bai, Adam Kortylewski, Alan Yuille

TL;DR

DIRECT-3D tackles the lack of large-scale, high-quality 3D data by training a diffusion-based 3D generator directly on massive noisy in-the-wild assets. It uses a disentangled tri-plane diffusion pipeline to produce geometry $ extbf{f}_g$ and color $ extbf{f}_c$ that decode into NeRFs, with an iterative pose-estimation step to automatically align data. A 3D super-resolution plug-in and coarse-to-fine caption enrichment further enhance high-resolution quality and conditioning. The approach achieves state-of-the-art results for single-class and text-to-3D generation, and provides a useful 3D geometry prior that improves existing 2D-lifting methods, demonstrating practical scalability and impact for 3D content creation.

Abstract

We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets (represented by Neural Radiance Fields) from text prompts. Unlike recent 3D generative models that rely on clean and well-aligned 3D data, limiting them to single or few-class generation, our model is directly trained on extensive noisy and unaligned `in-the-wild' 3D assets, mitigating the key challenge (i.e., data scarcity) in large-scale 3D generation. In particular, DIRECT-3D is a tri-plane diffusion model that integrates two innovations: 1) A novel learning framework where noisy data are filtered and aligned automatically during the training process. Specifically, after an initial warm-up phase using a small set of clean data, an iterative optimization is introduced in the diffusion process to explicitly estimate the 3D pose of objects and select beneficial data based on conditional density. 2) An efficient 3D representation that is achieved by disentangling object geometry and color features with two separate conditional diffusion models that are optimized hierarchically. Given a prompt input, our model generates high-quality, high-resolution, realistic, and complex 3D objects with accurate geometric details in seconds. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation. We also demonstrate that DIRECT-3D can serve as a useful 3D geometric prior of objects, for example to alleviate the well-known Janus problem in 2D-lifting methods such as DreamFusion. The code and models are available for research purposes at: https://github.com/qihao067/direct3d.

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

TL;DR

and color

that decode into NeRFs, with an iterative pose-estimation step to automatically align data. A 3D super-resolution plug-in and coarse-to-fine caption enrichment further enhance high-resolution quality and conditioning. The approach achieves state-of-the-art results for single-class and text-to-3D generation, and provides a useful 3D geometry prior that improves existing 2D-lifting methods, demonstrating practical scalability and impact for 3D content creation.

Abstract

Paper Structure (29 sections, 7 equations, 14 figures, 5 tables)

This paper contains 29 sections, 7 equations, 14 figures, 5 tables.

Introduction
Related Work
Method
Tri-plane Diffusion for NeRF Generation
Training with Noisy and Unaligned Data
3D Super Resolution
Coarse to Fine-gained Caption Generation
Experiments
Single-class 3D Generation
Direct Text-to-3D Generation
Improving 2D-lifting Methods with 3D Prior
Ablation Studies
Ablation of Automatic Alignment and Cleaning
Ablation of Disentanglement
Ablation of Prompt Enrichment
...and 14 more sections

Figures (14)

Figure 1: Different from optimization-based 2D-lifting methods such as DreamFusion dreamfusion, DIRECT-3D directly generates 3D contents in a single forward pass (a). To mitigate the lack of high-quality 3D data, DIRECT-3D enables efficient end-to-end training of 3D generative models on massive noisy and unaligned 'in-the-wild' 3D assets (b). Once trained, DIRECT-3D can generate high-quality 3D objects with accurate geometric details and various textures in 12 seconds on a single V100, driven by text prompts (c). DIRECT-3D can also be used as effective 3D geometry prior that significantly alleviates the Janus problem in 2D-lifting methods (d).
Figure 2: Method overview. Given a prompt, we generate a NeRF with two modules: The disentangled tri-plane diffusion module uses 2 (or 4 if the super-resolution plug-in is used) diffusion models to generate geometry ($\mathbf{f}_g$) and color ($\mathbf{f}_c$) tri-plane separately. Then both tri-planes are reshaped and fed into a NeRF auto-decoder to get the final outputs. During training, an iterative optimization process is introduced in the geometry diffusion to explicitly model the pose $\theta$ of objects and select beneficial ones, enabling efficient training on noisy 'in-the-wild' data. The whole model is end-to-end trainable (with or without SR plug-in), with only multi-view 2D images as supervision.
Figure 3: Qualitative comparison with Shap-E shape. We use the same text prompt as in Shap-E (top 2 rows) and DreamFusion (middle 2 rows), we also compare the performance on complex objects (last row). For Shap-E, we use the official code and model. For our method, we generate objects in $128^3$ without the super-resolution plug-in. All images of both methods are rendered at $256^2$. Our DIRECT-3D generates 3D objects with enhanced quality in both geometry and texture. We also generate more various and complex objects.
Figure 4: DIRECT-3D provides a useful 3D prior for 2D-lifting methods dreamfusion. Our 3D prior alleviates issues such as multiple faces and missing/extra limbs, while also improving texture quality. Please check the video results in Supp. for a better comparison.
Figure 5: Tri-plane feature learned with/without Automatic Alignment and Cleaning (AAC) on Objaverse. It roughly aligns the objects to get clear tri-plane features. Unaligned objects can be captured by tri-plane representation, but the inadequate axis disentanglement makes it challenging for the diffusion model to learn.
...and 9 more figures

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

TL;DR

Abstract

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

Authors

TL;DR

Abstract

Table of Contents

Figures (14)