Table of Contents
Fetching ...

Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects

Shalini Maiti, Lourdes Agapito, Filippos Kokkinos

TL;DR

Gen3DEval tackles the lack of human-aligned, scalable evaluation for text-to-3D generation by training a vision-language model to jointly judge appearance, surface quality, and text fidelity from multi-view renderings. The approach combines a fine-tuned Llama3-based vLLM with a learnable image-to-text projection and carefully curated datasets (artist meshes, human preferences, and synthetic perturbations), enabling robust pairwise comparisons that feed an ELO ranking on Gen3DEval-Bench. Key contributions include the vLLM-based holistic metric, a public benchmark with 80 prompts, and extensive ablations demonstrating strong alignment with human judgments across multiple evaluation axes. The framework offers a practical standard for comparing 3D generation methods and sets the stage for scalable, human-centered evaluation in the field, with potential impact on research and development pipelines.

Abstract

Rapid advancements in text-to-3D generation require robust and scalable evaluation metrics that align closely with human judgment, a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. Gen3DEval evaluates text fidelity, appearance, and surface quality by analyzing 3D surface normals, without requiring ground-truth comparisons, bridging the gap between automated metrics and user preferences. Compared to state-of-the-art task-agnostic models, Gen3DEval demonstrates superior performance in user-aligned evaluations, placing it as a comprehensive and accessible benchmark for future research on text-to-3D generation. The project page can be found here: \href{https://shalini-maiti.github.io/gen3deval.github.io/}{https://shalini-maiti.github.io/gen3deval.github.io/}.

Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects

TL;DR

Gen3DEval tackles the lack of human-aligned, scalable evaluation for text-to-3D generation by training a vision-language model to jointly judge appearance, surface quality, and text fidelity from multi-view renderings. The approach combines a fine-tuned Llama3-based vLLM with a learnable image-to-text projection and carefully curated datasets (artist meshes, human preferences, and synthetic perturbations), enabling robust pairwise comparisons that feed an ELO ranking on Gen3DEval-Bench. Key contributions include the vLLM-based holistic metric, a public benchmark with 80 prompts, and extensive ablations demonstrating strong alignment with human judgments across multiple evaluation axes. The framework offers a practical standard for comparing 3D generation methods and sets the stage for scalable, human-centered evaluation in the field, with potential impact on research and development pipelines.

Abstract

Rapid advancements in text-to-3D generation require robust and scalable evaluation metrics that align closely with human judgment, a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. Gen3DEval evaluates text fidelity, appearance, and surface quality by analyzing 3D surface normals, without requiring ground-truth comparisons, bridging the gap between automated metrics and user preferences. Compared to state-of-the-art task-agnostic models, Gen3DEval demonstrates superior performance in user-aligned evaluations, placing it as a comprehensive and accessible benchmark for future research on text-to-3D generation. The project page can be found here: \href{https://shalini-maiti.github.io/gen3deval.github.io/}{https://shalini-maiti.github.io/gen3deval.github.io/}.

Paper Structure

This paper contains 30 sections, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Gen3DEval framework: In stage 1, we train a vLLM to choose which object is better in terms of appearance, surface quality or text fidelity. This is further divided into 2 parts. In pre-training, we train the vision-to-language projector using image summary VQA. In the supervised fine-tuning (SFT) stage, we use comparison data to train for instruction following and preference evaluation. In stage 2, we compute a ranking metric for the set of methods by applying the trained vLLM from stage 1 pairwise on Gen3DEval-Bench prompts.
  • Figure 2: Training Dataset We use single and multi-view RGB and surface normals renderings of a 3D object generated from a prompt. We take these objects and perturb them to simulate common appearance, surface and text-related artefacts in generative 3D methods.
  • Figure 3: Qualitative Comparison of methods on samples of the evaluation dataset across text fidelity, appearance and surface evaluation.
  • Figure 4: Pre-training Dataset We use multiple views of RGB and surface normal maps rendered from a 3D object, accompanied by a Question-Answer prompt that summarizes the object.
  • Figure 5: Pre-training Dataset We use single and multiple views rendered from a 3D object as well as an image grid composed of the aforementioned multi-view (4) RGB images.
  • ...and 11 more figures