Table of Contents
Fetching ...

Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation

Shivam Duggal, Yushi Hu, Oscar Michel, Aniruddha Kembhavi, William T. Freeman, Noah A. Smith, Ranjay Krishna, Antonio Torralba, Ali Farhadi, Wei-Chiu Ma

TL;DR

Eval3D proposes an interpretable, fine grained evaluation framework for text and image driven 3D generation by leveraging a diverse set of foundation models as probes. It defines five complementary metrics—Geometric, Semantic, Structural, Text-3D Alignment, and Aesthetic—plus 3D artifact localization, enabling pixel level and 3D space localization of inconsistencies. The framework is validated on a curated Eval3D Benchmark with dense human annotations and shows stronger alignment with human judgments than prior open or closed source metrics. Results reveal that many top performing 3D generators still suffer from geometric or semantic inconsistencies, and image guidance can improve semantic and structural coherence. Eval3D is open-source and modular, promoting reliable evaluation and potential feedback-driven improvements in 3D generation systems.

Abstract

Despite the unprecedented progress in the field of 3D generation, current systems still often fail to produce high-quality 3D assets that are visually appealing and geometrically and semantically consistent across multiple viewpoints. To effectively assess the quality of the generated 3D data, there is a need for a reliable 3D evaluation tool. Unfortunately, existing 3D evaluation metrics often overlook the geometric quality of generated assets or merely rely on black-box multimodal large language models for coarse assessment. In this paper, we introduce Eval3D, a fine-grained, interpretable evaluation tool that can faithfully evaluate the quality of generated 3D assets based on various distinct yet complementary criteria. Our key observation is that many desired properties of 3D generation, such as semantic and geometric consistency, can be effectively captured by measuring the consistency among various foundation models and tools. We thus leverage a diverse set of models and tools as probes to evaluate the inconsistency of generated 3D assets across different aspects. Compared to prior work, Eval3D provides pixel-wise measurement, enables accurate 3D spatial feedback, and aligns more closely with human judgments. We comprehensively evaluate existing 3D generation models using Eval3D and highlight the limitations and challenges of current models.

Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation

TL;DR

Eval3D proposes an interpretable, fine grained evaluation framework for text and image driven 3D generation by leveraging a diverse set of foundation models as probes. It defines five complementary metrics—Geometric, Semantic, Structural, Text-3D Alignment, and Aesthetic—plus 3D artifact localization, enabling pixel level and 3D space localization of inconsistencies. The framework is validated on a curated Eval3D Benchmark with dense human annotations and shows stronger alignment with human judgments than prior open or closed source metrics. Results reveal that many top performing 3D generators still suffer from geometric or semantic inconsistencies, and image guidance can improve semantic and structural coherence. Eval3D is open-source and modular, promoting reliable evaluation and potential feedback-driven improvements in 3D generation systems.

Abstract

Despite the unprecedented progress in the field of 3D generation, current systems still often fail to produce high-quality 3D assets that are visually appealing and geometrically and semantically consistent across multiple viewpoints. To effectively assess the quality of the generated 3D data, there is a need for a reliable 3D evaluation tool. Unfortunately, existing 3D evaluation metrics often overlook the geometric quality of generated assets or merely rely on black-box multimodal large language models for coarse assessment. In this paper, we introduce Eval3D, a fine-grained, interpretable evaluation tool that can faithfully evaluate the quality of generated 3D assets based on various distinct yet complementary criteria. Our key observation is that many desired properties of 3D generation, such as semantic and geometric consistency, can be effectively captured by measuring the consistency among various foundation models and tools. We thus leverage a diverse set of models and tools as probes to evaluate the inconsistency of generated 3D assets across different aspects. Compared to prior work, Eval3D provides pixel-wise measurement, enables accurate 3D spatial feedback, and aligns more closely with human judgments. We comprehensively evaluate existing 3D generation models using Eval3D and highlight the limitations and challenges of current models.

Paper Structure

This paper contains 52 sections, 4 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Challenges of 3D generation: (1) Structural inconsistency: lack of globally-coherent 3D shape; (2) Text-3D misalignment: failure to meet the requirements of the input text-prompt; (3) Semantic inconsistency: content change and incoherent semantics; (4) Geometric inconsistency: misaligned geometry and texture.
  • Figure 2: Eval3D offers interpretable, fine-grained, and human-aligned metrics to assess the quality of 3D generations from various aspects. We utilize a diverse array of foundation models and tools to measure the consistency among different representations of generated 3D assets.
  • Figure 3: Geometry inconsistency evaluates texture-geometry misalignment by comparing 3D rendered normal and image-based normal. Bright-yellow indicates large discrepancy.
  • Figure 4: Structural consistency evaluates the geometric coherence of the generated 3D assets by comparing rendered views with the predictions from a novel view synthesis model (Zero-123) across various rotations. We utilize DreamSim to assess image similarity.
  • Figure 5: 3D inconsistency maps: The proposed 3D metrics, semantic and geometric consistencies, allow fine-grained localization of the artifacts (eg: Janus issue: mutliple nose / face, inconsistent hand geometry, arbitrary surface patterns on the back) by fusing / computing the metrics in 3D space.
  • ...and 16 more figures