FM-Fusion: Instance-aware Semantic Mapping Boosted by Vision-Language Foundation Models

Chuhao Liu; Ke Wang; Jieqi Shi; Zhijian Qiao; Shaojie Shen

FM-Fusion: Instance-aware Semantic Mapping Boosted by Vision-Language Foundation Models

Chuhao Liu, Ke Wang, Jieqi Shi, Zhijian Qiao, Shaojie Shen

TL;DR

This work proposes a probabilistic label fusion method to predict close-set semantic classes from open-set label measurements, and incrementally reconstructs an instance-aware semantic map from object detection generated from foundation models.

Abstract

Semantic mapping based on the supervised object detectors is sensitive to image distribution. In real-world environments, the object detection and segmentation performance can lead to a major drop, preventing the use of semantic mapping in a wider domain. On the other hand, the development of vision-language foundation models demonstrates a strong zero-shot transferability across data distribution. It provides an opportunity to construct generalizable instance-aware semantic maps. Hence, this work explores how to boost instance-aware semantic mapping from object detection generated from foundation models. We propose a probabilistic label fusion method to predict close-set semantic classes from open-set label measurements. An instance refinement module merges the over-segmented instances caused by inconsistent segmentation. We integrate all the modules into a unified semantic mapping system. Reading a sequence of RGB-D input, our work incrementally reconstructs an instance-aware semantic map. We evaluate the zero-shot performance of our method in ScanNet and SceneNN datasets. Our method achieves 40.3 mean average precision (mAP) on the ScanNet semantic instance segmentation task. It outperforms the traditional semantic mapping method significantly.

FM-Fusion: Instance-aware Semantic Mapping Boosted by Vision-Language Foundation Models

TL;DR

Abstract

Paper Structure (17 sections, 11 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 11 equations, 14 figures, 4 tables, 1 algorithm.

Introduction
Related Works
Vision-Language Foundation Models
Semantic Mapping
Fuse Multi-frame Detections
Overview
Prepare the object detector
Data association and integration
Probabilistic label fusion
Instance refinement
Merge over-segmentation
Instance-geometry fusion
Experiment
ScanNet Evaluation
SceneNN evaluation
...and 2 more sections

Figures (14)

Figure 1: Our system reads a sequence of RGB-D frames. The vision-language foundation models detect objects in open-set labels and high-quality masks. The SLAM modules generate a camera pose and a global volumetric map. Our method incrementally fuses the object detections from foundation models into an instance-aware semantic map. A reconstructed semantic map from ScanNet scene0011_01 is shown.
Figure 2: System overview of FM-Fusion
Figure 3: GroundingDINO detects a bookshelf and generates multiple open-set label measurements across frames. Our label fusion module predicts its semantic class in NYUv2 label-set $\mathcal{L}_c$ from label measurements in $\mathcal{L}_o$.
Figure 4: The label likelihood matrix $p(y_i=o_m,\exists o_m\in q^t|L_s=c_n)$ summarized in ScanNet is shown on the left. Each column represents a specific true semantic class $c_n$, while each row represents a measured open-set label $o_m$. On the right, it is a manually assigned likelihood matrix.
Figure 5: An example of an inconsistent instance mask generated from SAM. In each of the three frames, different areas of the bed are segmented.
...and 9 more figures

FM-Fusion: Instance-aware Semantic Mapping Boosted by Vision-Language Foundation Models

TL;DR

Abstract

FM-Fusion: Instance-aware Semantic Mapping Boosted by Vision-Language Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)