Table of Contents
Fetching ...

Bootstrap Segmentation Foundation Model under Distribution Shift via Object-Centric Learning

Luyao Tang, Yuxuan Yuan, Chaoqi Chen, Kunze Huang, Xinghao Ding, Yue Huang

TL;DR

This work addresses the generalization gap of segmentation foundation models under distribution and prompt shifts by introducing SlotSAM, a method that learns unsupervised object-centric representations from a foundation model's encoder and injects them as object tokens into the decoder. By redefining the reconstruction target to high-level encoder features and employing a bootstrapped, low-parameter fine-tuning regime, SlotSAM achieves robust object perception with minimal additional training. The approach demonstrates strong empirical gains across natural, medical, camouflaged, and robotic datasets, including surpassing 90% mIoU on medical segmentation and notable improvements on challenging camouflaged scenes, while maintaining efficiency. This has practical impact for deploying segmentation foundations models in diverse, real-world environments where labeled data and annotation quality vary.

Abstract

Foundation models have made incredible strides in achieving zero-shot or few-shot generalization, leveraging prompt engineering to mimic the problem-solving approach of human intelligence. However, when it comes to some foundation models like Segment Anything, there is still a challenge in performing well on out-of-distribution data, including camouflaged and medical images. Inconsistent prompting strategies during fine-tuning and testing further compound the issue, leading to decreased performance. Drawing inspiration from how human cognition processes new environments, we introduce SlotSAM, a method that reconstructs features from the encoder in a self-supervised manner to create object-centric representations. These representations are then integrated into the foundation model, bolstering its object-level perceptual capabilities while reducing the impact of distribution-related variables. The beauty of SlotSAM lies in its simplicity and adaptability to various tasks, making it a versatile solution that significantly enhances the generalization abilities of foundation models. Through limited parameter fine-tuning in a bootstrap manner, our approach paves the way for improved generalization in novel environments. The code is available at github.com/lytang63/SlotSAM.

Bootstrap Segmentation Foundation Model under Distribution Shift via Object-Centric Learning

TL;DR

This work addresses the generalization gap of segmentation foundation models under distribution and prompt shifts by introducing SlotSAM, a method that learns unsupervised object-centric representations from a foundation model's encoder and injects them as object tokens into the decoder. By redefining the reconstruction target to high-level encoder features and employing a bootstrapped, low-parameter fine-tuning regime, SlotSAM achieves robust object perception with minimal additional training. The approach demonstrates strong empirical gains across natural, medical, camouflaged, and robotic datasets, including surpassing 90% mIoU on medical segmentation and notable improvements on challenging camouflaged scenes, while maintaining efficiency. This has practical impact for deploying segmentation foundations models in diverse, real-world environments where labeled data and annotation quality vary.

Abstract

Foundation models have made incredible strides in achieving zero-shot or few-shot generalization, leveraging prompt engineering to mimic the problem-solving approach of human intelligence. However, when it comes to some foundation models like Segment Anything, there is still a challenge in performing well on out-of-distribution data, including camouflaged and medical images. Inconsistent prompting strategies during fine-tuning and testing further compound the issue, leading to decreased performance. Drawing inspiration from how human cognition processes new environments, we introduce SlotSAM, a method that reconstructs features from the encoder in a self-supervised manner to create object-centric representations. These representations are then integrated into the foundation model, bolstering its object-level perceptual capabilities while reducing the impact of distribution-related variables. The beauty of SlotSAM lies in its simplicity and adaptability to various tasks, making it a versatile solution that significantly enhances the generalization abilities of foundation models. Through limited parameter fine-tuning in a bootstrap manner, our approach paves the way for improved generalization in novel environments. The code is available at github.com/lytang63/SlotSAM.
Paper Structure (9 sections, 4 equations, 4 figures, 1 table)

This paper contains 9 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Performance comparison between SAM, WDASS, WESAM and SlotSAM across different downstream tasks under distribution shift and prompt shift.
  • Figure 2: Overview of SlotSAM. Stage 1 is to obtain slots by reconstructing higher-order semantics. Stage 2 is to inject slots into the foundation model by nonlinearly combining them into object token and self-training. The whole process is task-independent.
  • Figure 3: Comparison of the quality of the slots.
  • Figure 4: Comparison of the fineness of the predicted masks.