Table of Contents
Fetching ...

Segment Any Object Model (SAOM): Real-to-Simulation Fine-Tuning Strategy for Multi-Class Multi-Instance Segmentation

Mariia Khan, Yue Qiu, Yuren Cong, Jumana Abu-Khalaf, David Suter, Bodo Rosenhahn

TL;DR

SAOM significantly improves on SAM, with a 28% increase in mIoU and a 25% increase in mAcc for 54 frequently-seen indoor object classes, and the Real-to-Simulation fine-tuning strategy demonstrates promising generalization performance in real environments without being trained on the real-world data (sim-to-real).

Abstract

Multi-class multi-instance segmentation is the task of identifying masks for multiple object classes and multiple instances of the same class within an image. The foundational Segment Anything Model (SAM) is designed for promptable multi-class multi-instance segmentation but tends to output part or sub-part masks in the "everything" mode for various real-world applications. Whole object segmentation masks play a crucial role for indoor scene understanding, especially in robotics applications. We propose a new domain invariant Real-to-Simulation (Real-Sim) fine-tuning strategy for SAM. We use object images and ground truth data collected from Ai2Thor simulator during fine-tuning (real-to-sim). To allow our Segment Any Object Model (SAOM) to work in the "everything" mode, we propose the novel nearest neighbour assignment method, updating point embeddings for each ground-truth mask. SAOM is evaluated on our own dataset collected from Ai2Thor simulator. SAOM significantly improves on SAM, with a 28% increase in mIoU and a 25% increase in mAcc for 54 frequently-seen indoor object classes. Moreover, our Real-to-Simulation fine-tuning strategy demonstrates promising generalization performance in real environments without being trained on the real-world data (sim-to-real). The dataset and the code will be released after publication.

Segment Any Object Model (SAOM): Real-to-Simulation Fine-Tuning Strategy for Multi-Class Multi-Instance Segmentation

TL;DR

SAOM significantly improves on SAM, with a 28% increase in mIoU and a 25% increase in mAcc for 54 frequently-seen indoor object classes, and the Real-to-Simulation fine-tuning strategy demonstrates promising generalization performance in real environments without being trained on the real-world data (sim-to-real).

Abstract

Multi-class multi-instance segmentation is the task of identifying masks for multiple object classes and multiple instances of the same class within an image. The foundational Segment Anything Model (SAM) is designed for promptable multi-class multi-instance segmentation but tends to output part or sub-part masks in the "everything" mode for various real-world applications. Whole object segmentation masks play a crucial role for indoor scene understanding, especially in robotics applications. We propose a new domain invariant Real-to-Simulation (Real-Sim) fine-tuning strategy for SAM. We use object images and ground truth data collected from Ai2Thor simulator during fine-tuning (real-to-sim). To allow our Segment Any Object Model (SAOM) to work in the "everything" mode, we propose the novel nearest neighbour assignment method, updating point embeddings for each ground-truth mask. SAOM is evaluated on our own dataset collected from Ai2Thor simulator. SAOM significantly improves on SAM, with a 28% increase in mIoU and a 25% increase in mAcc for 54 frequently-seen indoor object classes. Moreover, our Real-to-Simulation fine-tuning strategy demonstrates promising generalization performance in real environments without being trained on the real-world data (sim-to-real). The dataset and the code will be released after publication.
Paper Structure (16 sections, 1 equation, 6 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 1 equation, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison between vanilla SAM and our SAOM on images from Ai2Thor simulator (first row) and real-life scenes (second and third rows), where we apply the "everything" mode to obtain the displayed segmentation. We opted for thicker border lines in our SAOM model to emphasize the whole object nature of the segmentation masks.
  • Figure 2: Our domain invariant Real-to-Simulation (Real-Sim) SAM's fine-tuning strategy. We use object images and GT data collected from Ai2Thor simulator during fine-tuning (real-to-sim) stage. The fine-tuned model, SAOM, can be directly tested on real images (sim-to-real) without being previously trained on real-world data.
  • Figure 3: Our domain invariant SAM's fine-tuning strategy - SAOM - designed specifically for multi-class multi-instance semantic segmentation in the "everything" mode. We used our novel nearest neighbour assignment method and substituted the original object point prompts with their nearest neighbours from a pre-defined point grid on an image to make the model functional in the "everything" mode.
  • Figure 4: Comparison between vanilla SAM, Semantic-SAM (SemSAM) and our SAOM on images from real-to-sim test set, where we adopt the "everything" mode to obtain SAM segmentation for 4 different scene types. Note that objects such as bed, laptop and blinds in a bedroom, microwave and fridge in a kitchen, TV and armchair in a living room or toilet and plunger in a bathroom have a whole-object segmentation mask with SAOM.
  • Figure 5: Comparison between vanilla SAM model, SAOM with simple foreground object points selection and SAOM with nearest neighbour assignment method. We adopt the "everything" mode to obtain segmentation masks. Note that sofa is segmented as a whole object only with SAOM.
  • ...and 1 more figures