Table of Contents
Fetching ...

JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas

Sandeep Inuganti, Hideaki Kanayama, Kanta Shimizu, Mahdi Chamseddine, Soichiro Yokota, Didier Stricker, Jason Rambach

TL;DR

JOPP-3D is presented, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding and achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.

Abstract

Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.

JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas

TL;DR

JOPP-3D is presented, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding and achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.

Abstract

Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.
Paper Structure (25 sections, 16 equations, 7 figures, 4 tables)

This paper contains 25 sections, 16 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Joint treatment of panoramas and 3D point clouds in JOPP-3D. Example of zero-shot performance on a construction site scene with varying mask scale.
  • Figure 2: Methodology: Sec.$(3.1)$ We decompose all the panoramic captures of a scene into their corresponding tangential perspectives and depths. The other outputs of this step are also the tangential poses and the 3D point cloud. Sec.$(3.2)$ We then use these different visual modalities to extract 3D instances schult2023mask3dsam3d of the reconstructed scene and align these instances with CLIP radford2021learning embeddings. Sec.$(3.3)$ In the final step we use open-vocabulary querying to extract the required 3D scene semantics, and further use a depth correspondence-based 3D-to-panoramic semantic extraction method, to obtain the semantic segmentation for each panoramic input.
  • Figure 3: Representation of a spherical image using tangential decomposition. Left: Original spherical image. Middle: Tangential perspectives mapped onto the faces of an icosahedron. Right: Close-up view of selected tangential perspective images extracted from the spherical image.
  • Figure 4: 3D-to-Panoramic Semantic Extractor. (a) Shows the process of panoramic imaging from the top view of a Room. (b) We try to emulate the same process on the extracted 3D semantic point cloud by placing the panoramic camera at the same poses. However, after each capture, we transfer the semantic labels of nearby Rooms (scenes) by establishing depth correspondences.
  • Figure 5: Qualitative analysis: Each row represents one panoramic capture from a different scene in the Stanford-2D-3D-s 2017arXiv170201105A dataset. We present the semantic segmentation result from various versions of JOPP-3D. Notable differences are highlighted via dotted curves. In the first and fifth row we can observe why our depth correspondence technique yields segmentations through doors into nearby scenes. In the third row we can see a clear problem of not masking the SAM segmentAnything crops - floor and ceiling cannot be segmented. Further, querying for chair also selects floor. Note: The last column has a lot of clutter palette. Reason: For clutter, we assign all the 3D points that could not be queried using the other 12 classes, and as expected, w/o SAM Mask ablation suffers the worst due to bad CLIP radford2021learning embedding alignment.
  • ...and 2 more figures