Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

Huy Ha; Shuran Song

Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

Huy Ha, Shuran Song

TL;DR

The paper tackles open-world 3D scene understanding by separating semantic reasoning from geometric reasoning. It introduces Semantic Abstraction (SemAbs), which couples a semantic-aware wrapper that generates relevancy maps from a pretrained 2D Vision-Language Model with a semantic-agnostic 3D module that completes 3D occupancies from those abstractions. This approach enables open vocabulary generalization, robust zero-shot transfer, and sim2real applicability, demonstrated on two tasks: Open Vocabulary Semantic Scene Completion (OVSSC) and Visually Obscured Object Localization (VOOL). Key contributions include a multi-scale relevancy extractor for small objects, a data-generation and training pipeline in AI2-THOR, and strong empirical results across novel rooms, vocabulary, classes, materials, and real-world scans. SemAbs thus provides a practical pathway to leverage large pretrained VLMs for open-world robotic perception without sacrificing zero-shot robustness.

Abstract

We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual inputs - a critical skill for robots to operate in the unstructured 3D world. Towards this end, we propose Semantic Abstraction (SemAbs), a framework that equips 2D Vision-Language Models (VLMs) with new 3D spatial capabilities, while maintaining their zero-shot robustness. We achieve this abstraction using relevancy maps extracted from CLIP, and learn 3D spatial and geometric reasoning skills on top of those abstractions in a semantic-agnostic manner. We demonstrate the usefulness of SemAbs on two open-world 3D scene understanding tasks: 1) completing partially observed objects and 2) localizing hidden objects from language descriptions. Experiments show that SemAbs can generalize to novel vocabulary, materials/lighting, classes, and domains (i.e., real-world scans) from training on limited 3D synthetic data. Code and data is available at https://semantic-abstraction.cs.columbia.edu/

Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

TL;DR

Abstract

Paper Structure (20 sections, 10 figures, 4 tables)

This paper contains 20 sections, 10 figures, 4 tables.

Introduction
Related Works
Method: Semantic Abstraction
Abstraction via Relevancy
A Multi-Scale Relevancy Extractor
Application to Open Vocabulary Semantic Scene Completion (OVSSC)
Application to Visually Obscured Object Localization (VOOL)
Experiments
Open-world Evaluation Results
Zero-shot Evaluation Results
Limitations and assumptions
Conclusion
Appendix
Relevancy as VLM's confidence
Network
...and 5 more sections

Figures (10)

Figure 1: Open-World 3D Scene Understanding. Our approach, Semantic Abstraction, unlocks 2D VLM's capabilities to 3D scene understanding. Trained with a small simulated data, our model generalizes to unseen classes in a novel domain (i.e., real world scans), from small objects like "rubiks cube", to long-tail concepts like "harry potter", to hidden objects like the "used N95s in the garbage bin".
Figure 2: Open-world generalization requirements can build on top of each other, forming tiers of open-worldness (distinct property of each tier outlined yellow).
Figure 3: Semantic Abstraction Overview. Our framework can be applied to open-world 3D scene understanding tasks (a-b) using the SemAbs module (c). It consists of a semantic-aware wrapper (green background) that abstracts the input image and semantic label into a relevancy map, and a semantic-abstracted 3D module (grey background) that completes the projected relevancy map into a 3D occupancy. This abstraction allows our approach to generalize to long-tail semantic labels unseen (bolded) during 3D training, such as the "CoRL ticket on top of the fireplace".
Figure 4: Our relevancy extractor robustly detects even small, long-tail objects, like the "nintendo switch".
Figure 5: Semantic Abstraction inherits CLIP's visual-semantic reasoning skils From distinguishing colors (e.g. "red folder" v.s. "blue folder") to recognizing cultural (e.g. "hogwarts box") and long-tail semantic concepts (e.g. "roomba", "hydrangea"), our approach offloads such visual-semantic reasoning challenges to CLIP. Indoing so, its learned 3D spatial and geometric reasoning skills transfers sim2real in a zero-shot manner.
...and 5 more figures

Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

TL;DR

Abstract

Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)