Table of Contents
Fetching ...

OpenLex3D: A Tiered Evaluation Benchmark for Open-Vocabulary 3D Scene Representations

Christina Kassab, Sacha Morin, Martin Büchner, Matías Mattamala, Kumaraditya Gupta, Abhinav Valada, Liam Paull, Maurice Fallon

TL;DR

OpenLex3D addresses the gap in evaluating open-vocabulary 3D scene representations by introducing a four-category label taxonomy (synonyms, depictions, visually similar, clutter) and relabeling Replica, ScanNet++, and HM3D to 3812 objects with up to 13x more labels per scene. It defines two evaluation tasks—tiered open-set semantic segmentation and open-set object retrieval—with large per-dataset prompt lists and per-scene retrieval queries to probe language-grounded 3D perception. Two metrics, Top-N Frequency and Set Ranking, quantify per-point category accuracy and the distribution of predictions across precision tiers, revealing distinct failure modes across methods. Experiments show that no single method excels across both tasks, highlighting the need for improved feature fusion and segmentation strategies, and the benchmark is publicly available for widespread use.

Abstract

3D scene understanding has been transformed by open-vocabulary language models that enable interaction via natural language. However, at present the evaluation of these representations is limited to datasets with closed-set semantics that do not capture the richness of language. This work presents OpenLex3D, a dedicated benchmark for evaluating 3D open-vocabulary scene representations. OpenLex3D provides entirely new label annotations for scenes from Replica, ScanNet++, and HM3D, which capture real-world linguistic variability by introducing synonymical object categories and additional nuanced descriptions. Our label sets provide 13 times more labels per scene than the original datasets. By introducing an open-set 3D semantic segmentation task and an object retrieval task, we evaluate various existing 3D open-vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement. Our experiments provide insights on feature precision, segmentation, and downstream capabilities. The benchmark is publicly available at: https://openlex3d.github.io/.

OpenLex3D: A Tiered Evaluation Benchmark for Open-Vocabulary 3D Scene Representations

TL;DR

OpenLex3D addresses the gap in evaluating open-vocabulary 3D scene representations by introducing a four-category label taxonomy (synonyms, depictions, visually similar, clutter) and relabeling Replica, ScanNet++, and HM3D to 3812 objects with up to 13x more labels per scene. It defines two evaluation tasks—tiered open-set semantic segmentation and open-set object retrieval—with large per-dataset prompt lists and per-scene retrieval queries to probe language-grounded 3D perception. Two metrics, Top-N Frequency and Set Ranking, quantify per-point category accuracy and the distribution of predictions across precision tiers, revealing distinct failure modes across methods. Experiments show that no single method excels across both tasks, highlighting the need for improved feature fusion and segmentation strategies, and the benchmark is publicly available for widespread use.

Abstract

3D scene understanding has been transformed by open-vocabulary language models that enable interaction via natural language. However, at present the evaluation of these representations is limited to datasets with closed-set semantics that do not capture the richness of language. This work presents OpenLex3D, a dedicated benchmark for evaluating 3D open-vocabulary scene representations. OpenLex3D provides entirely new label annotations for scenes from Replica, ScanNet++, and HM3D, which capture real-world linguistic variability by introducing synonymical object categories and additional nuanced descriptions. Our label sets provide 13 times more labels per scene than the original datasets. By introducing an open-set 3D semantic segmentation task and an object retrieval task, we evaluate various existing 3D open-vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement. Our experiments provide insights on feature precision, segmentation, and downstream capabilities. The benchmark is publicly available at: https://openlex3d.github.io/.

Paper Structure

This paper contains 35 sections, 7 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: The OpenLex3D evaluation benchmark enables more detailed analysis of open-vocabulary 3D scene representations than closed-vocabulary evaluation methods. We compare the same open-vocabulary representation when assessed under closed-vocabulary semantics (left) and using OpenLex3D labels (right). In contrast to closed-vocabulary methods where a prediction must match the exact ground truth label, OpenLex3D provides a manifold of label categories of varying precision: synonyms being the most precise; depictions, which include, e.g., printed images on objects; visually similar, which refer to objects with comparable appearance; and clutter, which accounts for label perturbation due to imprecise segmentation.
  • Figure 2: OpenLex3D label example on ScanNet++ yeshwanthliu2023scannetpp. We provide not only synonyms for the object but also labels for various potential failure cases, including depictions visible on the target object (e.g., flower prints), visually similar objects (e.g., blanket), and surrounding clutter (indicated by the IDs of the neighboring objects).
  • Figure 3: Top-N Frequency and Set Ranking metrics illustration. (a) Top-N Freq. measures whether any of the top-N responses contain a label from category C. (b) Set Ranking evaluates the ranking of responses, assessing how closely the predicted rankings align with ideal rankings of categories.
  • Figure 4: Top-5 Freq. results for category classification for OpenMask3D takmaz2023openmask3d, ConceptGraphs ConceptGraphs2023 and ConceptFusion conceptfusion colored by category class. Object-centric methods that segment in 3D, like OpenMask3D (top), often miss points due to generalization or depth quality issues. Those merging 2D segments tend to merge smaller ones, leading to misclassifications (middle). Dense representations, such as ConceptFusion, produce noisier predictions due to point-level features aggregating information from various context scales. In the highly cluttered environments of ScanNet++ yeshwanthliu2023scannetpp, all evaluated methods show reduced performance.
  • Figure 5: Example predictions and categories. We show a correctly predicted label (top). Depictions handles cases in which the image depicted on an object is mistaken for the object itself (middle). The visually similar category handles reasonable but unprecise predictions (bottom).
  • ...and 10 more figures