Table of Contents
Fetching ...

RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation

Naman Patel, Prashanth Krishnamurthy, Farshad Khorrami

TL;DR

RAZER addresses open-vocabulary 3D scene understanding in real time by unifying GPU-accelerated volumetric reconstruction with frozen vision-language embeddings in a training-free framework. It introduces an online, instance-level fusion strategy with R-tree spatial indexing and incremental OBB updates to maintain temporal consistency despite 2D segmentation noise. The approach also maintains a multi-hypothesis semantic embedding bank and voxel-level semantic maps to enable online 3D instance segmentation and zero-shot instance retrieval. Across SceneNN, ScanNet200, and Replica, RAZER achieves state-of-the-art results in 3D instance segmentation, 3D open-vocabulary segmentation, and 3D instance retrieval while delivering real-time performance, demonstrating practical applicability for embodied AI and robotics.

Abstract

Mapping and understanding complex 3D environments is fundamental to how autonomous systems perceive and interact with the physical world, requiring both precise geometric reconstruction and rich semantic comprehension. While existing 3D semantic mapping systems excel at reconstructing and identifying predefined object instances, they lack the flexibility to efficiently build semantic maps with open-vocabulary during online operation. Although recent vision-language models have enabled open-vocabulary object recognition in 2D images, they haven't yet bridged the gap to 3D spatial understanding. The critical challenge lies in developing a training-free unified system that can simultaneously construct accurate 3D maps while maintaining semantic consistency and supporting natural language interactions in real time. In this paper, we develop a zero-shot framework that seamlessly integrates GPU-accelerated geometric reconstruction with open-vocabulary vision-language models through online instance-level semantic embedding fusion, guided by hierarchical object association with spatial indexing. Our training-free system achieves superior performance through incremental processing and unified geometric-semantic updates, while robustly handling 2D segmentation inconsistencies. The proposed general-purpose 3D scene understanding framework can be used for various tasks including zero-shot 3D instance retrieval, segmentation, and object detection to reason about previously unseen objects and interpret natural language queries. The project page is available at https://razer-3d.github.io.

RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation

TL;DR

RAZER addresses open-vocabulary 3D scene understanding in real time by unifying GPU-accelerated volumetric reconstruction with frozen vision-language embeddings in a training-free framework. It introduces an online, instance-level fusion strategy with R-tree spatial indexing and incremental OBB updates to maintain temporal consistency despite 2D segmentation noise. The approach also maintains a multi-hypothesis semantic embedding bank and voxel-level semantic maps to enable online 3D instance segmentation and zero-shot instance retrieval. Across SceneNN, ScanNet200, and Replica, RAZER achieves state-of-the-art results in 3D instance segmentation, 3D open-vocabulary segmentation, and 3D instance retrieval while delivering real-time performance, demonstrating practical applicability for embodied AI and robotics.

Abstract

Mapping and understanding complex 3D environments is fundamental to how autonomous systems perceive and interact with the physical world, requiring both precise geometric reconstruction and rich semantic comprehension. While existing 3D semantic mapping systems excel at reconstructing and identifying predefined object instances, they lack the flexibility to efficiently build semantic maps with open-vocabulary during online operation. Although recent vision-language models have enabled open-vocabulary object recognition in 2D images, they haven't yet bridged the gap to 3D spatial understanding. The critical challenge lies in developing a training-free unified system that can simultaneously construct accurate 3D maps while maintaining semantic consistency and supporting natural language interactions in real time. In this paper, we develop a zero-shot framework that seamlessly integrates GPU-accelerated geometric reconstruction with open-vocabulary vision-language models through online instance-level semantic embedding fusion, guided by hierarchical object association with spatial indexing. Our training-free system achieves superior performance through incremental processing and unified geometric-semantic updates, while robustly handling 2D segmentation inconsistencies. The proposed general-purpose 3D scene understanding framework can be used for various tasks including zero-shot 3D instance retrieval, segmentation, and object detection to reason about previously unseen objects and interpret natural language queries. The project page is available at https://razer-3d.github.io.

Paper Structure

This paper contains 28 sections, 17 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Pipeline overview of our proposed 3D scene understanding framework. Our system processes posed RGB-D inputs through open-vocabulary segmentation for robust 3D instance tracking. Spatio-temporal feature aggregation fuses and prunes tracks while updating a panoptic map that enables online text-based 3D instance retrieval and segmentation tasks.
  • Figure 2: System-level architecture of our RAZER framework. It processes RGB, depth, and pose inputs through three modules: (1) Instance Tracking to enable efficient feature updates, (2) Aggregation Manager to aggregate and fuse/prune instances and their corresponding coarse features, and (3) Map Update to update features at voxel level and their corresponding labels, thus generating a panoptic map that enables 3D scene understanding.