Table of Contents
Fetching ...

A Survey On Text-to-3D Contents Generation In The Wild

Chenhan Jiang

TL;DR

Open-vocabulary text-to-3D generation aims to automate 3D asset creation from natural language prompts, but faces data scarcity, topology challenges, and substantial computational costs. The survey categorizes methods into feedforward generation, optimization-based generation, and view reconstruction, detailing 3D representations (explicit, implicit, hybrid) and diffusion priors that drive progress. It catalogs datasets (ShapeNet, Objaverse, Cap3D) and representative systems (Shap-E, DreamFusion, Instant3D), and surveys diffusion innovations including SDS, VSD, CDS, and view-aware finetuning that enable open-vocabulary 3D synthesis. The analysis highlights current limitations and outlines future directions, including higher-fidelity mesh generation, standardized evaluation benchmarks, and integrating large language models to improve text alignment and controllability, with a view toward democratizing 3D content creation.

Abstract

3D content creation plays a vital role in various applications, such as gaming, robotics simulation, and virtual reality. However, the process is labor-intensive and time-consuming, requiring skilled designers to invest considerable effort in creating a single 3D asset. To address this challenge, text-to-3D generation technologies have emerged as a promising solution for automating 3D creation. Leveraging the success of large vision language models, these techniques aim to generate 3D content based on textual descriptions. Despite recent advancements in this area, existing solutions still face significant limitations in terms of generation quality and efficiency. In this survey, we conduct an in-depth investigation of the latest text-to-3D creation methods. We provide a comprehensive background on text-to-3D creation, including discussions on datasets employed in training and evaluation metrics used to assess the quality of generated 3D models. Then, we delve into the various 3D representations that serve as the foundation for the 3D generation process. Furthermore, we present a thorough comparison of the rapidly growing literature on generative pipelines, categorizing them into feedforward generators, optimization-based generation, and view reconstruction approaches. By examining the strengths and weaknesses of these methods, we aim to shed light on their respective capabilities and limitations. Lastly, we point out several promising avenues for future research. With this survey, we hope to inspire researchers further to explore the potential of open-vocabulary text-conditioned 3D content creation.

A Survey On Text-to-3D Contents Generation In The Wild

TL;DR

Open-vocabulary text-to-3D generation aims to automate 3D asset creation from natural language prompts, but faces data scarcity, topology challenges, and substantial computational costs. The survey categorizes methods into feedforward generation, optimization-based generation, and view reconstruction, detailing 3D representations (explicit, implicit, hybrid) and diffusion priors that drive progress. It catalogs datasets (ShapeNet, Objaverse, Cap3D) and representative systems (Shap-E, DreamFusion, Instant3D), and surveys diffusion innovations including SDS, VSD, CDS, and view-aware finetuning that enable open-vocabulary 3D synthesis. The analysis highlights current limitations and outlines future directions, including higher-fidelity mesh generation, standardized evaluation benchmarks, and integrating large language models to improve text alignment and controllability, with a view toward democratizing 3D content creation.

Abstract

3D content creation plays a vital role in various applications, such as gaming, robotics simulation, and virtual reality. However, the process is labor-intensive and time-consuming, requiring skilled designers to invest considerable effort in creating a single 3D asset. To address this challenge, text-to-3D generation technologies have emerged as a promising solution for automating 3D creation. Leveraging the success of large vision language models, these techniques aim to generate 3D content based on textual descriptions. Despite recent advancements in this area, existing solutions still face significant limitations in terms of generation quality and efficiency. In this survey, we conduct an in-depth investigation of the latest text-to-3D creation methods. We provide a comprehensive background on text-to-3D creation, including discussions on datasets employed in training and evaluation metrics used to assess the quality of generated 3D models. Then, we delve into the various 3D representations that serve as the foundation for the 3D generation process. Furthermore, we present a thorough comparison of the rapidly growing literature on generative pipelines, categorizing them into feedforward generators, optimization-based generation, and view reconstruction approaches. By examining the strengths and weaknesses of these methods, we aim to shed light on their respective capabilities and limitations. Lastly, we point out several promising avenues for future research. With this survey, we hope to inspire researchers further to explore the potential of open-vocabulary text-conditioned 3D content creation.
Paper Structure (22 sections, 13 equations, 10 figures, 4 tables)

This paper contains 22 sections, 13 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: In this survey, we investigate various text-to-3D content generation in the wild and categorize them delineated by algorithmic methodologies. Feedforward generation directly outputs 3D representations given text. Optimization-based generation optimizes parametric 3D representations using gradients from a 2D diffusion model. View reconstruction follows a text-to-images-to-3D paradigm. Representative 3D generation results are obtained from Shap-E jun2023shape, DreamFusion dreamfusion22 and Instant3D li2023instant3d.
  • Figure 2: The structure of CLIP radford2021learning.
  • Figure 3: The process of DDPM ho2020denoising.
  • Figure 4: Comparison of different representations with regard to rendering speed, memory usage with increasing resolution, shape deformation, the time of data preprocessing and representation capacity for arbitrary geometry. A larger number of $\bigstar$ and a smaller number of indicate better performance.
  • Figure 5: Qualitative results of feedforward generation from 3D dataset. Compared with DMTet and occupancy representation, NeRF used in Shap-E jun2023shape tends to produce holes.
  • ...and 5 more figures