Multi-Dimensional Quality Assessment for Text-to-3D Assets: Dataset and Model
Kang Fu, Huiyu Duan, Zicheng Zhang, Xiaohong Liu, Xiongkuo Min, Jia Wang, Guangtao Zhai
TL;DR
This work tackles the lack of objective, multi-perspective quality assessment for text-to-3D assets by introducing the AIGC-T23DAQA database (969 validated assets across 170 prompts and 6 generation models) and a projection-based T23DAQA framework. The proposed method decomposes perception into shape, texture, and text-asset correspondence via projection videos and fuses these features through an MLP to predict three MOS-aligned scores: quality, authenticity, and correspondence. It demonstrates superior performance over a wide range of NR/IQA and VQA baselines, with ablations showing the critical role of the text-image alignment module and the value of a multi-branch, multimodal approach. The dataset and methodology enable robust benchmarking and optimization of text-to-3D asset generation, supporting improved perceptual quality and semantic fidelity in AI-generated 3D content.
Abstract
Recent advancements in text-to-image (T2I) generation have spurred the development of text-to-3D asset (T23DA) generation, leveraging pretrained 2D text-to-image diffusion models for text-to-3D asset synthesis. Despite the growing popularity of text-to-3D asset generation, its evaluation has not been well considered and studied. However, given the significant quality discrepancies among various text-to-3D assets, there is a pressing need for quality assessment models aligned with human subjective judgments. To tackle this challenge, we conduct a comprehensive study to explore the T23DA quality assessment (T23DAQA) problem in this work from both subjective and objective perspectives. Given the absence of corresponding databases, we first establish the largest text-to-3D asset quality assessment database to date, termed the AIGC-T23DAQA database. This database encompasses 969 validated 3D assets generated from 170 prompts via 6 popular text-to-3D asset generation models, and corresponding subjective quality ratings for these assets from the perspectives of quality, authenticity, and text-asset correspondence, respectively. Subsequently, we establish a comprehensive benchmark based on the AIGC-T23DAQA database, and devise an effective T23DAQA model to evaluate the generated 3D assets from the aforementioned three perspectives, respectively.
