Table of Contents
Fetching ...

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

Fangzhou Hong, Jiaxiang Tang, Ziang Cao, Min Shi, Tong Wu, Zhaoxi Chen, Shuai Yang, Tengfei Wang, Liang Pan, Dahua Lin, Ziwei Liu

TL;DR

This work addresses the challenge of generating high-quality 3D assets from natural language by combining fast feed-forward diffusion-based generation with a high-fidelity refinement stage. It introduces 3DTopia, a two-stage pipeline that uses a tri-plane latent diffusion model for rapid coarse 3D generation and a hybrid SDS-based refinement leveraging both latent-space and pixel-space 2D diffusion priors to produce detailed textures. A large-scale data curation pipeline (3DTopia-360K) based on Objaverse captions and LLM-based processing provides rich training data. Empirical results show improvements over baselines in texture fidelity and CLIP-based alignment, highlighting the practical potential for rapid, controllable text-to-3D asset creation.

Abstract

We present a two-stage text-to-3D generation system, namely 3DTopia, which generates high-quality general 3D assets within 5 minutes using hybrid diffusion priors. The first stage samples from a 3D diffusion prior directly learned from 3D data. Specifically, it is powered by a text-conditioned tri-plane latent diffusion model, which quickly generates coarse 3D samples for fast prototyping. The second stage utilizes 2D diffusion priors to further refine the texture of coarse 3D models from the first stage. The refinement consists of both latent and pixel space optimization for high-quality texture generation. To facilitate the training of the proposed system, we clean and caption the largest open-source 3D dataset, Objaverse, by combining the power of vision language models and large language models. Experiment results are reported qualitatively and quantitatively to show the performance of the proposed system. Our codes and models are available at https://github.com/3DTopia/3DTopia

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

TL;DR

This work addresses the challenge of generating high-quality 3D assets from natural language by combining fast feed-forward diffusion-based generation with a high-fidelity refinement stage. It introduces 3DTopia, a two-stage pipeline that uses a tri-plane latent diffusion model for rapid coarse 3D generation and a hybrid SDS-based refinement leveraging both latent-space and pixel-space 2D diffusion priors to produce detailed textures. A large-scale data curation pipeline (3DTopia-360K) based on Objaverse captions and LLM-based processing provides rich training data. Empirical results show improvements over baselines in texture fidelity and CLIP-based alignment, highlighting the practical potential for rapid, controllable text-to-3D asset creation.

Abstract

We present a two-stage text-to-3D generation system, namely 3DTopia, which generates high-quality general 3D assets within 5 minutes using hybrid diffusion priors. The first stage samples from a 3D diffusion prior directly learned from 3D data. Specifically, it is powered by a text-conditioned tri-plane latent diffusion model, which quickly generates coarse 3D samples for fast prototyping. The second stage utilizes 2D diffusion priors to further refine the texture of coarse 3D models from the first stage. The refinement consists of both latent and pixel space optimization for high-quality texture generation. To facilitate the training of the proposed system, we clean and caption the largest open-source 3D dataset, Objaverse, by combining the power of vision language models and large language models. Experiment results are reported qualitatively and quantitatively to show the performance of the proposed system. Our codes and models are available at https://github.com/3DTopia/3DTopia
Paper Structure (22 sections, 12 equations, 9 figures, 3 tables)

This paper contains 22 sections, 12 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Architecture Comparison of Different 3D Generation Paradigms. We combine the advantages of feed-forward network and optimization-based methods and propose a two-stage generation system.
  • Figure 2: Overview of our text-to-3D generation system. We propose a two-stage generation system. The first system is a text-guided latent diffusion model (B.1 and B.2). To train the diffusion model, we first prepare a large text-3D paired dataset. We utilize a large web-crawled 3D dataset, Objaverse deitke2023objaverse, and render multi-view images from 3D assets (A.1). Then, multiple large language models are used for captioning and data cleaning. We choose to use tri-plane to parameterize 3D models (A.2). For the second stage, we use Score Distillation Sampling (SDS) for mesh refinement (C). It consists of two steps, i.e., latent-space refinement and pixel-space refinement.
  • Figure 3: 3D captioning pipeline. We first use LLaVA to generate raw captions for multi-view renderings, which is then simplified by Vicuna. Then we use GPT-3.5 to aggregate multi-view captions into a single caption.
  • Figure 4: Examples of our 3D captions and comparison with Cap3D. Red texts are wrong parts. Green texts are correct parts. Compared with Cap3D captions, we provide longer captions with more details.
  • Figure 5: Tri-plane fitting and VAE results. We show ground truth multi-view images in the first row. The second row is the tri-plane fitting renderings. The third row shows the tri-plane VAE reconstruction results. We achieve a high compression rate while maintaining decent reconstruction quality.
  • ...and 4 more figures