OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

Youjun Zhao; Jiaying Lin; Shuquan Ye; Qianshi Pang; Rynson W. H. Lau

OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

Youjun Zhao, Jiaying Lin, Shuquan Ye, Qianshi Pang, Rynson W. H. Lau

TL;DR

This work defines Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) to extend OV-3D beyond object-class queries to abstract object attributes. It introduces OpenScan, a large-scale benchmark based on ScanNet200 that annotates 341 attributes across eight linguistic aspects (affordance, property, type, manner, synonym, requirement, element, material) over 1,513 scenes, enabling comprehensive evaluation of attribute-grounded 3D understanding. Across seven strong OV-3D baselines, results reveal substantial gaps in attribute comprehension, with performance drops particularly for abstract attributes like affordance and property, and demonstrate that simply enlarging the vocabulary or transferring OV-3D methods to GOV-3D is insufficient. The paper also analyzes the impact of query form, vocabulary size, and reveals the potential of attribute-aware reasoning, including leveraging LLMs for attribute-to-class mapping and incorporating attribute knowledge into visual-language models. Overall, OpenScan provides a robust platform for diagnosing and guiding improvements in generalized open-vocabulary 3D scene understanding and suggests promising directions for integrating structured attribute knowledge into 3D perception systems.

Abstract

Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond the closed set of object classes. However, existing approaches and benchmarks primarily focus on the open vocabulary problem within the context of object classes, which is insufficient in providing a holistic evaluation to what extent a model understands the 3D scene. In this paper, we introduce a more challenging task called Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) to explore the open vocabulary problem beyond object classes. It encompasses an open and diverse set of generalized knowledge, expressed as linguistic queries of fine-grained and object-specific attributes. To this end, we contribute a new benchmark named \textit{OpenScan}, which consists of 3D object attributes across eight representative linguistic aspects, including affordance, property, and material. We further evaluate state-of-the-art OV-3D methods on our OpenScan benchmark and discover that these methods struggle to comprehend the abstract vocabularies of the GOV-3D task, a challenge that cannot be addressed simply by scaling up object classes during training. We highlight the limitations of existing methodologies and explore promising directions to overcome the identified shortcomings.

OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

TL;DR

Abstract

Paper Structure (38 sections, 7 equations, 17 figures, 13 tables)

This paper contains 38 sections, 7 equations, 17 figures, 13 tables.

Introduction
Related Work
Task Setting and Benchmark
Task Formulation
Benchmark Description
Benchmark Annotation
Benchmark Statistics
Evaluation Metrics
Experiments
Main Results
The Impact of the Pre-trained Vocabulary Size
The Impact of the Query Form
Qualitative Results
Failure Cases Analysis
Conclusion
...and 23 more sections

Figures (17)

Figure 1: The proposed Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) task expands the vocabulary types of the classic 3D Scene Understanding (OV-3D) task. While OV-3D onlysupports queries of object classes, GOV-3D supports queries of object-related abstract attributes.
Figure 2: OpenScan benchmark samples. The target objects are highlighted in blue.
Figure 3: Illustration of the data generation process for our OpenScan benchmark.
Figure 4: Impact of different pre-training vocabulary size.
Figure 5: Qualitative results of Open3DIS on our OpenScan benchmark. The GT objects and outputs are highlighted in color.
...and 12 more figures

OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

TL;DR

Abstract

OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (17)