Table of Contents
Fetching ...

Towards Fine-Grained Text-to-3D Quality Assessment: A Benchmark and A Two-Stage Rank-Learning Metric

Bingyang Cui, Yujie Zhang, Qi Yang, Zhu Li, Yiling Xu

TL;DR

This paper targets fine-grained Text-to-3D quality assessment by building a large-scale, compositional benchmark (T23D-CompBench) and a two-stage evaluator (Rank2Score) that aligns closely with human judgments. It designs 360 prompts across 30 compositional combinations, generates 3,600 textured meshes with ten models, and collects 129,600 MOS-based annotations across twelve quality dimensions, enabling robust training of a cross-modal quality metric. Rank2Score uses CLIP-based visual/textual features, learnable dimension prompts, and a curriculum-driven training regime to efficiently learn relative rankings before refining to absolute MOS-aligned scores. Across four benchmarks, Rank2Score consistently outperforms existing metrics, demonstrates cross-benchmark generalization, and can serve as both a evaluative tool and a rewards signal to guide T23D model training, significantly advancing reliable, fine-grained assessment in this rapidly evolving domain.

Abstract

Recent advances in Text-to-3D (T23D) generative models have enabled the synthesis of diverse, high-fidelity 3D assets from textual prompts. However, existing challenges restrict the development of reliable T23D quality assessment (T23DQA). First, existing benchmarks are outdated, fragmented, and coarse-grained, making fine-grained metric training infeasible. Moreover, current objective metrics exhibit inherent design limitations, resulting in non-representative feature extraction and diminished metric robustness. To address these limitations, we introduce T23D-CompBench, a comprehensive benchmark for compositional T23D generation. We define five components with twelve sub-components for compositional prompts, which are used to generate 3,600 textured meshes from ten state-of-the-art generative models. A large-scale subjective experiment is conducted to collect 129,600 reliable human ratings across different perspectives. Based on T23D-CompBench, we further propose Rank2Score, an effective evaluator with two-stage training for T23DQA. Rank2Score enhances pairwise training via supervised contrastive regression and curriculum learning in the first stage, and subsequently refines predictions using mean opinion scores to achieve closer alignment with human judgments in the second stage. Extensive experiments and downstream applications demonstrate that Rank2Score consistently outperforms existing metrics across multiple dimensions and can additionally serve as a reward function to optimize generative models. The project is available at https://cbysjtu.github.io/Rank2Score/.

Towards Fine-Grained Text-to-3D Quality Assessment: A Benchmark and A Two-Stage Rank-Learning Metric

TL;DR

This paper targets fine-grained Text-to-3D quality assessment by building a large-scale, compositional benchmark (T23D-CompBench) and a two-stage evaluator (Rank2Score) that aligns closely with human judgments. It designs 360 prompts across 30 compositional combinations, generates 3,600 textured meshes with ten models, and collects 129,600 MOS-based annotations across twelve quality dimensions, enabling robust training of a cross-modal quality metric. Rank2Score uses CLIP-based visual/textual features, learnable dimension prompts, and a curriculum-driven training regime to efficiently learn relative rankings before refining to absolute MOS-aligned scores. Across four benchmarks, Rank2Score consistently outperforms existing metrics, demonstrates cross-benchmark generalization, and can serve as both a evaluative tool and a rewards signal to guide T23D model training, significantly advancing reliable, fine-grained assessment in this rapidly evolving domain.

Abstract

Recent advances in Text-to-3D (T23D) generative models have enabled the synthesis of diverse, high-fidelity 3D assets from textual prompts. However, existing challenges restrict the development of reliable T23D quality assessment (T23DQA). First, existing benchmarks are outdated, fragmented, and coarse-grained, making fine-grained metric training infeasible. Moreover, current objective metrics exhibit inherent design limitations, resulting in non-representative feature extraction and diminished metric robustness. To address these limitations, we introduce T23D-CompBench, a comprehensive benchmark for compositional T23D generation. We define five components with twelve sub-components for compositional prompts, which are used to generate 3,600 textured meshes from ten state-of-the-art generative models. A large-scale subjective experiment is conducted to collect 129,600 reliable human ratings across different perspectives. Based on T23D-CompBench, we further propose Rank2Score, an effective evaluator with two-stage training for T23DQA. Rank2Score enhances pairwise training via supervised contrastive regression and curriculum learning in the first stage, and subsequently refines predictions using mean opinion scores to achieve closer alignment with human judgments in the second stage. Extensive experiments and downstream applications demonstrate that Rank2Score consistently outperforms existing metrics across multiple dimensions and can additionally serve as a reward function to optimize generative models. The project is available at https://cbysjtu.github.io/Rank2Score/.

Paper Structure

This paper contains 45 sections, 19 equations, 20 figures, 15 tables.

Figures (20)

  • Figure 1: Illustration of common distortions occurred in generated meshes.
  • Figure 2: Difficulty levels of pairwise ranking across prompts, scores, and dimensions.
  • Figure 3: Illustration of the benchmark construction pipeline.
  • Figure 4: Samples generated from different models and component combinations.
  • Figure 5: Different samples with excellent and bad in each quality dimension.
  • ...and 15 more figures