Table of Contents
Fetching ...

Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era

Chenghao Li, Chaoning Zhang, Joseph Cho, Atish Waghwase, Lik-Hang Lee, Francois Rameau, Yang Yang, Sung-Ho Bae, Choong Seon Hong

TL;DR

This survey comprehensively maps text-to-3D in the AIGC era, detailing 3D data representations, foundational technologies (NeRF, diffusion, CLIP-based guidance), and seminal to modern methods that optimize fidelity, efficiency, consistency, controllability, and diversity. It surveys diverse text-to-3D applications, including avatar, scene, and texture generation as well as editing, and discusses evaluation standards, generalization, and social impact. The authors emphasize the remaining challenges—fidelity-speed trade-offs, view consistency, and diversity—and outline a future agenda focused on robust evaluation, generalization to real-world scenarios, and responsible deployment. Overall, the work provides a structured, multi-faceted roadmap to understand, compare, and advance text-to-3D technologies in the rapidly evolving 3D AIGC landscape.

Abstract

Generative AI has made significant progress in recent years, with text-guided content generation being the most practical as it facilitates interaction between human instructions and AI-generated content (AIGC). Thanks to advancements in text-to-image and 3D modeling technologies, like neural radiance field (NeRF), text-to-3D has emerged as a nascent yet highly active research field. Our work conducts a comprehensive survey on this topic and follows up on subsequent research progress in the overall field, aiming to help readers interested in this direction quickly catch up with its rapid development. First, we introduce 3D data representations, including both Structured and non-Structured data. Building on this pre-requisite, we introduce various core technologies to achieve satisfactory text-to-3D results. Additionally, we present mainstream baselines and research directions in recent text-to-3D technology, including fidelity, efficiency, consistency, controllability, diversity, and applicability. Furthermore, we summarize the usage of text-to-3D technology in various applications, including avatar generation, texture generation, scene generation and 3D editing. Finally, we discuss the agenda for the future development of text-to-3D.

Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era

TL;DR

This survey comprehensively maps text-to-3D in the AIGC era, detailing 3D data representations, foundational technologies (NeRF, diffusion, CLIP-based guidance), and seminal to modern methods that optimize fidelity, efficiency, consistency, controllability, and diversity. It surveys diverse text-to-3D applications, including avatar, scene, and texture generation as well as editing, and discusses evaluation standards, generalization, and social impact. The authors emphasize the remaining challenges—fidelity-speed trade-offs, view consistency, and diversity—and outline a future agenda focused on robust evaluation, generalization to real-world scenarios, and responsible deployment. Overall, the work provides a structured, multi-faceted roadmap to understand, compare, and advance text-to-3D technologies in the rapidly evolving 3D AIGC landscape.

Abstract

Generative AI has made significant progress in recent years, with text-guided content generation being the most practical as it facilitates interaction between human instructions and AI-generated content (AIGC). Thanks to advancements in text-to-image and 3D modeling technologies, like neural radiance field (NeRF), text-to-3D has emerged as a nascent yet highly active research field. Our work conducts a comprehensive survey on this topic and follows up on subsequent research progress in the overall field, aiming to help readers interested in this direction quickly catch up with its rapid development. First, we introduce 3D data representations, including both Structured and non-Structured data. Building on this pre-requisite, we introduce various core technologies to achieve satisfactory text-to-3D results. Additionally, we present mainstream baselines and research directions in recent text-to-3D technology, including fidelity, efficiency, consistency, controllability, diversity, and applicability. Furthermore, we summarize the usage of text-to-3D technology in various applications, including avatar generation, texture generation, scene generation and 3D editing. Finally, we discuss the agenda for the future development of text-to-3D.
Paper Structure (33 sections, 8 equations, 4 figures, 5 tables)

This paper contains 33 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Five 3D representations: (1) mesh, (2) voxel, (3) point cloud, (4) multi-view, and (5) neural field. Representation capability: the capability of a method to capture and express the complexity and details of 3D objects. Computation efficiency: the computation required to generate, render, or process 3D representations. Memory efficiency: the amount of memory required to store 3D representations.
  • Figure 2: Timeline of foundation technologies that contribute to modern text-to-3D methods.
  • Figure 3: Five enhancement cases represent improvements in fidelity, diversity, consistency, efficiency, and controllability. Efficiency shows time reduction from 2 hours 9 minutes poole2022dreamfusion to 31 minutes zhou2023dreampropeller. Diversity demonstrates more varied sandcastle models, improving from poole2022dreamfusion to yan2024flow. Consistency shows better visual coherence in generating a kitten and toucan, evolving from wang2022score to hong2023debiasing. Fidelity highlights improved image quality, from poole2022dreamfusion to zhu2023hifa. Controllability displays finer control in generated crowns and birds, advancing from poole2022dreamfusionmetzer2022latentwang2022score to chen2023control3d.
  • Figure 4: Future agenda of text-to-3D development.