Table of Contents
Fetching ...

Explaining Human Preferences via Metrics for Structured 3D Reconstruction

Jack Langerman, Denys Rozumnyi, Yuzhong Huang, Dmytro Mishkin

TL;DR

The paper tackles how to evaluate structured 3D wireframe reconstructions by systematically comparing a broad set of metrics against human expert judgments. It surveys traditional and novel metrics (including Wireframe Edit Distance, Edge Chamfer Distance, Hausdorff/IoU-based measures, and a Length Weighted Spectral Distance), and introduces a learned metric distilled from human judgments via DiNOv2 features with a Bradley–Terry supervision signal. A large-scale human ranking study, reliability analyses, and multiple aggregation approaches establish which metrics best align with expert preferences, revealing that edge- and corner-based metrics correlate more closely with human judgments than complex graph-based or spectral metrics. The work also develops unit-tests for metric properties, discusses practical use-case recommendations, and demonstrates a promising learned metric while cautioning about reward hacking, suggesting a practical benchmark practice that blends multiple, robust metrics for reliable evaluation.

Abstract

"What cannot be measured cannot be improved" while likely never uttered by Lord Kelvin, summarizes effectively the driving force behind this work. This paper presents a detailed discussion of automated metrics for evaluating structured 3D reconstructions. Pitfalls of each metric are discussed, and an analysis through the lens of expert 3D modelers' preferences is presented. A set of systematic "unit tests" are proposed to empirically verify desirable properties, and context aware recommendations regarding which metric to use depending on application are provided. Finally, a learned metric distilled from human expert judgments is proposed and analyzed. The source code is available at https://github.com/s23dr/wireframe-metrics-iccv2025

Explaining Human Preferences via Metrics for Structured 3D Reconstruction

TL;DR

The paper tackles how to evaluate structured 3D wireframe reconstructions by systematically comparing a broad set of metrics against human expert judgments. It surveys traditional and novel metrics (including Wireframe Edit Distance, Edge Chamfer Distance, Hausdorff/IoU-based measures, and a Length Weighted Spectral Distance), and introduces a learned metric distilled from human judgments via DiNOv2 features with a Bradley–Terry supervision signal. A large-scale human ranking study, reliability analyses, and multiple aggregation approaches establish which metrics best align with expert preferences, revealing that edge- and corner-based metrics correlate more closely with human judgments than complex graph-based or spectral metrics. The work also develops unit-tests for metric properties, discusses practical use-case recommendations, and demonstrates a promising learned metric while cautioning about reward hacking, suggesting a practical benchmark practice that blends multiple, robust metrics for reliable evaluation.

Abstract

"What cannot be measured cannot be improved" while likely never uttered by Lord Kelvin, summarizes effectively the driving force behind this work. This paper presents a detailed discussion of automated metrics for evaluating structured 3D reconstructions. Pitfalls of each metric are discussed, and an analysis through the lens of expert 3D modelers' preferences is presented. A set of systematic "unit tests" are proposed to empirically verify desirable properties, and context aware recommendations regarding which metric to use depending on application are provided. Finally, a learned metric distilled from human expert judgments is proposed and analyzed. The source code is available at https://github.com/s23dr/wireframe-metrics-iccv2025

Paper Structure

This paper contains 10 sections, 9 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: A motivating example for this work. While humans tend to sort the wireframes from best to worst in the presented order, popular metrics (defined in Sec \ref{['sec:metrics']}) sort them differently, sometimes completely inverting the order. Top, left to right: GT -- ground truth wireframe, WF1 -- wireframe with edges split into several segments, maintaining geometrical and topological accuracy, WF2 -- wireframe with missing vertices and edges, WF3 -- wireframe with only one correct vertex. Bottom: distances between GT and respective wireframe. Numbers that change sorting are in red.
  • Figure 2: Wireframe ranking interface for human annotators.
  • Figure 3: Examples of corrupted ground truth wireframes, used for wireframe ranking. Left to right: GT, deformed edges (deform_medium), vertex duplication and random movement (perturb_medium), edge addition (add_low), edge deletion(remove_low).
  • Figure 4: Annotator agreement (all pairs). Left to right: annotator agreement with each other, the learned metric, handcrafted metrics, and VLMs. Annotators background: A-K -- 3D modellers, Des[0-2] - designers, CV[0-3] - computer vision engineers. Best zoom-in.
  • Figure 5: Probability of selecting wrong "winner" depending on number of raters (left), individual accuracy (center), win rate (right).
  • ...and 7 more figures