GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark

Xiao Cai; Sitong Su; Jingkuan Song; Pengpeng Zeng; Ji Zhang; Qinhong Du; Mengqi Li; Heng Tao Shen; Lianli Gao

GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark

Xiao Cai, Sitong Su, Jingkuan Song, Pengpeng Zeng, Ji Zhang, Qinhong Du, Mengqi Li, Heng Tao Shen, Lianli Gao

TL;DR

GT23D-Bench addresses core bottlenecks in General Text-to-3D by delivering a large-scale, richly annotated 3D dataset and a geometry-aware evaluation framework. It combines 400K high-quality 3D assets with multimodal signals, 64-view renderings, and hierarchical captions, plus a 10-metric suite spanning text-3D alignment and 3D visual quality. The benchmark is validated against human judgments and used to analyze eight leading GT23D models, revealing gaps in fine-grained attribute alignment, multi-view consistency, and geometry fidelity. By providing publicly accessible data and metrics, GT23D-Bench sets a new standard for rigorous, reproducible GT23D research and evaluation.

Abstract

Text-to-3D (T23D) generation has emerged as a crucial visual generation task, aiming at synthesizing 3D content from textual descriptions. Studies of this task are currently shifting from per-scene T23D, which requires optimization of the model for every content generated, to General T23D (GT23D), which requires only one pre-trained model to generate different content without re-optimization, for more generalized and efficient 3D generation. Despite notable advancements, GT23D is severely bottlenecked by two interconnected challenges: the lack of high-quality, large-scale training data and the prevalence of evaluation metrics that overlook intrinsic 3D properties. Existing datasets often suffer from incomplete annotations, noisy organization, and inconsistent quality, while current evaluations rely heavily on 2D image-text similarity or scoring, failing to thoroughly assess 3D geometric integrity and semantic relevance. To address these fundamental gaps, we introduce GT23D-Bench, the first comprehensive benchmark specifically designed for GT23D training and evaluation. We first construct a high-quality dataset of 400K 3D assets, featuring diverse visual annotations (70M+ visual samples) and multi-granularity hierarchical captions (1M+ descriptions) to foster robust semantic learning. Second, we propose a comprehensive evaluation suite with 10 metrics assessing both text-3D alignment and 3D visual quality at multiple levels. Crucially, we demonstrate through rigorous experiments that our proposed metrics exhibit significantly higher correlation with human judgment compared to existing methods. Our in-depth analysis of eight leading GT23D models using this benchmark provides the community with critical insights into current model capabilities and their shared failure modes. GT23D-Bench will be publicly available to facilitate rigorous and reproducible research.

GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark

TL;DR

Abstract

GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)