Table of Contents
Fetching ...

NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation

Junyuan Fang, Zihan Wang, Yejun Zhang, Shuzhe Wang, Iaroslav Melekhov, Juho Kannala

TL;DR

A novel 3D Gaussian Splatting based hard visual prompting approach that leverages camera interpolation to generate diverse viewpoints around target objects without any 2D-3D optimization or fine-tuning, effectively augmenting existing hard visual prompts by enforcing geometric consistency across viewpoints.

Abstract

Vision-language models (VLMs) have demonstrated impressive zero-shot transfer capabilities in image-level visual perception tasks. However, they fall short in 3D instance-level segmentation tasks that require accurate localization and recognition of individual objects. To bridge this gap, we introduce a novel 3D Gaussian Splatting based hard visual prompting approach that leverages camera interpolation to generate diverse viewpoints around target objects without any 2D-3D optimization or fine-tuning. Our method simulates realistic 3D perspectives, effectively augmenting existing hard visual prompts by enforcing geometric consistency across viewpoints. This training-free strategy seamlessly integrates with prior hard visual prompts, enriching object-descriptive features and enabling VLMs to achieve more robust and accurate 3D instance segmentation in diverse 3D scenes.

NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation

TL;DR

A novel 3D Gaussian Splatting based hard visual prompting approach that leverages camera interpolation to generate diverse viewpoints around target objects without any 2D-3D optimization or fine-tuning, effectively augmenting existing hard visual prompts by enforcing geometric consistency across viewpoints.

Abstract

Vision-language models (VLMs) have demonstrated impressive zero-shot transfer capabilities in image-level visual perception tasks. However, they fall short in 3D instance-level segmentation tasks that require accurate localization and recognition of individual objects. To bridge this gap, we introduce a novel 3D Gaussian Splatting based hard visual prompting approach that leverages camera interpolation to generate diverse viewpoints around target objects without any 2D-3D optimization or fine-tuning. Our method simulates realistic 3D perspectives, effectively augmenting existing hard visual prompts by enforcing geometric consistency across viewpoints. This training-free strategy seamlessly integrates with prior hard visual prompts, enriching object-descriptive features and enabling VLMs to achieve more robust and accurate 3D instance segmentation in diverse 3D scenes.

Paper Structure

This paper contains 15 sections, 9 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Upper: Previous CLIP-based back-projection techniques takmaz2023openmask3dlu2023ovirnguyen2023open3dis struggle with object misclassification in complex 3D scenes. Limited views here cause the chair to be incorrectly identified as a "cushion" when both objects appear together. Lower: NVSMask3D utilizes novel view synthesis with hard visual prompts to generate interpolated views around the target object, creating a more continuous and detailed 3D representation.
  • Figure 2: Overview of the NVSMask3D Pipeline. NVSMask3D begins by applying a class-agnostic 3D proposal to segment the input point cloud, generating initial instance masks. Next, the 3D scene is represented using 3D-GS. Top-$k$ camera poses are selected as references to interpolate additional views. Visual prompts are applied to these interpolated views to emphasize object-descriptive features. CLIP features are then extracted from each image and combined using a WFB mechanism, which stabilizes feature contributions from NVS-generated and top-$k$ views, ultimately enhancing 3D instance segmentation accuracy.
  • Figure 3: Reference pose, camera pose adjustment and camera pose interpolation. Initial camera poses before adjustment, where the camera pose is misaligned with the object center (left). Adjusted camera pose, realigned towards the object's geometric center (middle). Interpolation and final readjustment of the interpolated camera pose to generate novel views from 3D-GS (right).
  • Figure 4: Qualitative renderings of raw images and four different hard visual prompts. From left to right: the raw input image, the interpolated object-centered camera pose, and the subsequent visual prompts: blurring, cropping, and segmented Gaussians.
  • Figure 5: Semantic Segmentation and Retrieval in Indoor Scenes. This figure showcases NVSMask3D's open-vocabulary instance segmentation results across various query categories: (a) Affordance ("sit", "hang"), (d) States ("a bin with a white bag", "a written whiteboard"), (g) Colors ("a white desk", "a red chair"), (j) Activities ("photographing", "drinking"), and (m) Objects ("rectangular table", "shelf with items"). Each row includes the original point cloud (left) and segmentation results with highlighted queries (middle, right).