Table of Contents
Fetching ...

GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark

Xiao Cai, Sitong Su, Jingkuan Song, Pengpeng Zeng, Ji Zhang, Qinhong Du, Mengqi Li, Heng Tao Shen, Lianli Gao

TL;DR

GT23D-Bench addresses core bottlenecks in General Text-to-3D by delivering a large-scale, richly annotated 3D dataset and a geometry-aware evaluation framework. It combines 400K high-quality 3D assets with multimodal signals, 64-view renderings, and hierarchical captions, plus a 10-metric suite spanning text-3D alignment and 3D visual quality. The benchmark is validated against human judgments and used to analyze eight leading GT23D models, revealing gaps in fine-grained attribute alignment, multi-view consistency, and geometry fidelity. By providing publicly accessible data and metrics, GT23D-Bench sets a new standard for rigorous, reproducible GT23D research and evaluation.

Abstract

Text-to-3D (T23D) generation has emerged as a crucial visual generation task, aiming at synthesizing 3D content from textual descriptions. Studies of this task are currently shifting from per-scene T23D, which requires optimization of the model for every content generated, to General T23D (GT23D), which requires only one pre-trained model to generate different content without re-optimization, for more generalized and efficient 3D generation. Despite notable advancements, GT23D is severely bottlenecked by two interconnected challenges: the lack of high-quality, large-scale training data and the prevalence of evaluation metrics that overlook intrinsic 3D properties. Existing datasets often suffer from incomplete annotations, noisy organization, and inconsistent quality, while current evaluations rely heavily on 2D image-text similarity or scoring, failing to thoroughly assess 3D geometric integrity and semantic relevance. To address these fundamental gaps, we introduce GT23D-Bench, the first comprehensive benchmark specifically designed for GT23D training and evaluation. We first construct a high-quality dataset of 400K 3D assets, featuring diverse visual annotations (70M+ visual samples) and multi-granularity hierarchical captions (1M+ descriptions) to foster robust semantic learning. Second, we propose a comprehensive evaluation suite with 10 metrics assessing both text-3D alignment and 3D visual quality at multiple levels. Crucially, we demonstrate through rigorous experiments that our proposed metrics exhibit significantly higher correlation with human judgment compared to existing methods. Our in-depth analysis of eight leading GT23D models using this benchmark provides the community with critical insights into current model capabilities and their shared failure modes. GT23D-Bench will be publicly available to facilitate rigorous and reproducible research.

GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark

TL;DR

GT23D-Bench addresses core bottlenecks in General Text-to-3D by delivering a large-scale, richly annotated 3D dataset and a geometry-aware evaluation framework. It combines 400K high-quality 3D assets with multimodal signals, 64-view renderings, and hierarchical captions, plus a 10-metric suite spanning text-3D alignment and 3D visual quality. The benchmark is validated against human judgments and used to analyze eight leading GT23D models, revealing gaps in fine-grained attribute alignment, multi-view consistency, and geometry fidelity. By providing publicly accessible data and metrics, GT23D-Bench sets a new standard for rigorous, reproducible GT23D research and evaluation.

Abstract

Text-to-3D (T23D) generation has emerged as a crucial visual generation task, aiming at synthesizing 3D content from textual descriptions. Studies of this task are currently shifting from per-scene T23D, which requires optimization of the model for every content generated, to General T23D (GT23D), which requires only one pre-trained model to generate different content without re-optimization, for more generalized and efficient 3D generation. Despite notable advancements, GT23D is severely bottlenecked by two interconnected challenges: the lack of high-quality, large-scale training data and the prevalence of evaluation metrics that overlook intrinsic 3D properties. Existing datasets often suffer from incomplete annotations, noisy organization, and inconsistent quality, while current evaluations rely heavily on 2D image-text similarity or scoring, failing to thoroughly assess 3D geometric integrity and semantic relevance. To address these fundamental gaps, we introduce GT23D-Bench, the first comprehensive benchmark specifically designed for GT23D training and evaluation. We first construct a high-quality dataset of 400K 3D assets, featuring diverse visual annotations (70M+ visual samples) and multi-granularity hierarchical captions (1M+ descriptions) to foster robust semantic learning. Second, we propose a comprehensive evaluation suite with 10 metrics assessing both text-3D alignment and 3D visual quality at multiple levels. Crucially, we demonstrate through rigorous experiments that our proposed metrics exhibit significantly higher correlation with human judgment compared to existing methods. Our in-depth analysis of eight leading GT23D models using this benchmark provides the community with critical insights into current model capabilities and their shared failure modes. GT23D-Bench will be publicly available to facilitate rigorous and reproducible research.

Paper Structure

This paper contains 21 sections, 10 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Illustration of GT23D-Bench. GT23D-Bench is the first benchmark for General Text-to-3D, which consists of two components: 1) a 400K multimodal-annotated label-organized thoroughly-filtered 3D Dataset (left) and 2) comprehensive 3D-Aware Evaluation Metrics (right).
  • Figure 2: Evaluation Results of Current GT23D Models based on GT23D-Bench Metrics. We visualize the evaluation results of eight GT23D generation models in 10 dimensions. For comprehensive results, please refer to Tab.\ref{['tab:rankings']}
  • Figure 3: Illustration of current GT23D dataset issues: (a) annotation issues, including overly generic or overly detailed text and insufficient multi-modal or multi-view visual annotations; (b) organizational issues, such as absent, misaligned, or overlapping category labels; (c) asset quality issues, including missing textures, fragmented geometry, and non-semantic or abstract 3D objects. These issues collectively highlight the urgent need for a more structured, semantically aligned, and high-quality 3D dataset.
  • Figure 4: Illustration of Dataset Annotation Pipeline. 1) We render each object from 64 uniformly distributed viewpoints to obtain RGB images, depth maps, and normal maps by leveraging Pyvista. 2) We aggregate visual features from the front, top, and side views and use a large multimodal model to generate hierarchical textual annotations, ranging from coarse-grained category labels to fine-grained attribute-level descriptions. The more detailed design of the hierarchical description prompts is provided in Suppl. I-C.
  • Figure 5: Illustration of caption granularity differences across datasets. Existing text–3D datasets typically provide captions of a single granularity—either overly simple or excessively detailed. In contrast, our hierarchical captions deliver more comprehensive and balanced object descriptions, integrating global semantics with fine-grained attribute details.
  • ...and 9 more figures