Explaining Human Preferences via Metrics for Structured 3D Reconstruction
Jack Langerman, Denys Rozumnyi, Yuzhong Huang, Dmytro Mishkin
TL;DR
The paper tackles how to evaluate structured 3D wireframe reconstructions by systematically comparing a broad set of metrics against human expert judgments. It surveys traditional and novel metrics (including Wireframe Edit Distance, Edge Chamfer Distance, Hausdorff/IoU-based measures, and a Length Weighted Spectral Distance), and introduces a learned metric distilled from human judgments via DiNOv2 features with a Bradley–Terry supervision signal. A large-scale human ranking study, reliability analyses, and multiple aggregation approaches establish which metrics best align with expert preferences, revealing that edge- and corner-based metrics correlate more closely with human judgments than complex graph-based or spectral metrics. The work also develops unit-tests for metric properties, discusses practical use-case recommendations, and demonstrates a promising learned metric while cautioning about reward hacking, suggesting a practical benchmark practice that blends multiple, robust metrics for reliable evaluation.
Abstract
"What cannot be measured cannot be improved" while likely never uttered by Lord Kelvin, summarizes effectively the driving force behind this work. This paper presents a detailed discussion of automated metrics for evaluating structured 3D reconstructions. Pitfalls of each metric are discussed, and an analysis through the lens of expert 3D modelers' preferences is presented. A set of systematic "unit tests" are proposed to empirically verify desirable properties, and context aware recommendations regarding which metric to use depending on application are provided. Finally, a learned metric distilled from human expert judgments is proposed and analyzed. The source code is available at https://github.com/s23dr/wireframe-metrics-iccv2025
