Table of Contents
Fetching ...

Multi-Dimensional Quality Assessment for Text-to-3D Assets: Dataset and Model

Kang Fu, Huiyu Duan, Zicheng Zhang, Xiaohong Liu, Xiongkuo Min, Jia Wang, Guangtao Zhai

TL;DR

This work tackles the lack of objective, multi-perspective quality assessment for text-to-3D assets by introducing the AIGC-T23DAQA database (969 validated assets across 170 prompts and 6 generation models) and a projection-based T23DAQA framework. The proposed method decomposes perception into shape, texture, and text-asset correspondence via projection videos and fuses these features through an MLP to predict three MOS-aligned scores: quality, authenticity, and correspondence. It demonstrates superior performance over a wide range of NR/IQA and VQA baselines, with ablations showing the critical role of the text-image alignment module and the value of a multi-branch, multimodal approach. The dataset and methodology enable robust benchmarking and optimization of text-to-3D asset generation, supporting improved perceptual quality and semantic fidelity in AI-generated 3D content.

Abstract

Recent advancements in text-to-image (T2I) generation have spurred the development of text-to-3D asset (T23DA) generation, leveraging pretrained 2D text-to-image diffusion models for text-to-3D asset synthesis. Despite the growing popularity of text-to-3D asset generation, its evaluation has not been well considered and studied. However, given the significant quality discrepancies among various text-to-3D assets, there is a pressing need for quality assessment models aligned with human subjective judgments. To tackle this challenge, we conduct a comprehensive study to explore the T23DA quality assessment (T23DAQA) problem in this work from both subjective and objective perspectives. Given the absence of corresponding databases, we first establish the largest text-to-3D asset quality assessment database to date, termed the AIGC-T23DAQA database. This database encompasses 969 validated 3D assets generated from 170 prompts via 6 popular text-to-3D asset generation models, and corresponding subjective quality ratings for these assets from the perspectives of quality, authenticity, and text-asset correspondence, respectively. Subsequently, we establish a comprehensive benchmark based on the AIGC-T23DAQA database, and devise an effective T23DAQA model to evaluate the generated 3D assets from the aforementioned three perspectives, respectively.

Multi-Dimensional Quality Assessment for Text-to-3D Assets: Dataset and Model

TL;DR

This work tackles the lack of objective, multi-perspective quality assessment for text-to-3D assets by introducing the AIGC-T23DAQA database (969 validated assets across 170 prompts and 6 generation models) and a projection-based T23DAQA framework. The proposed method decomposes perception into shape, texture, and text-asset correspondence via projection videos and fuses these features through an MLP to predict three MOS-aligned scores: quality, authenticity, and correspondence. It demonstrates superior performance over a wide range of NR/IQA and VQA baselines, with ablations showing the critical role of the text-image alignment module and the value of a multi-branch, multimodal approach. The dataset and methodology enable robust benchmarking and optimization of text-to-3D asset generation, supporting improved perceptual quality and semantic fidelity in AI-generated 3D content.

Abstract

Recent advancements in text-to-image (T2I) generation have spurred the development of text-to-3D asset (T23DA) generation, leveraging pretrained 2D text-to-image diffusion models for text-to-3D asset synthesis. Despite the growing popularity of text-to-3D asset generation, its evaluation has not been well considered and studied. However, given the significant quality discrepancies among various text-to-3D assets, there is a pressing need for quality assessment models aligned with human subjective judgments. To tackle this challenge, we conduct a comprehensive study to explore the T23DA quality assessment (T23DAQA) problem in this work from both subjective and objective perspectives. Given the absence of corresponding databases, we first establish the largest text-to-3D asset quality assessment database to date, termed the AIGC-T23DAQA database. This database encompasses 969 validated 3D assets generated from 170 prompts via 6 popular text-to-3D asset generation models, and corresponding subjective quality ratings for these assets from the perspectives of quality, authenticity, and text-asset correspondence, respectively. Subsequently, we establish a comprehensive benchmark based on the AIGC-T23DAQA database, and devise an effective T23DAQA model to evaluate the generated 3D assets from the aforementioned three perspectives, respectively.

Paper Structure

This paper contains 23 sections, 12 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Illustration of the difference between traditional 3d asset and AI generated 3d asset, whose perceptual quality are affected by different attributes.
  • Figure 2: An Overview of the established AIGC-T23DAQA database and the proposed T23DAQA method. AIGC-T23DAQA database is the first and the largest text-to-3d assets quality assessment database. This database encompasses 969 validated 3D assets generated from 170 prompts via 6 popular text-to-3D asset generation models, and corresponding subjective quality ratings. In addition, we popose a T23DAQA method to predict the text-to-3D asset quality from three aspects: shape, texture, and correspondence. The proposed method achieves the state-of-the-art performance in evaluating the perceptual attributes of text-to-3d assets.
  • Figure 3: The Pie Chart of our used Prompt, which contains 11 challenge categories and 12 scene categories.
  • Figure 4: Sample 3D assets from the AIGC-T23DAQA database, generated by Dreamfusion poole2022dreamfusion, LatentNerf metzer2023latent; Magic3D lin2023magic3d, Prolificdreamer wang2024prolificdreamer; SJCwang2023score, TextMesh tsalicoglou2023textmesh with the same input prompt respectively. (a) 3D assets generated by the prompt "a harp without any strings". (b) 3D assets generated by the prompt "a pair of brown suede shoes". This clearly shows that the visual quality of assets generated by different models varies greatly.
  • Figure 5: Illustration of the differences between the three dimensions of quality ,authenticity, and text-3D correspondence. In each subfigure, the images in the top row are significantly better than the that in bottom row in terms of two perspectives, while similar or worse in terms of another perspective. (a) and (b) show examples that the authenticity and correspondence scores of the top images are higher, while the quality is similar. (c) and (d) show examples that the quality and correspondence scores of the top images are higher, while the authenticity is similar or lower. (e) and (f) show examples that the quality and authenticity scores of the top images are higher, while the correspondence is similar or lower.
  • ...and 6 more figures