Table of Contents
Fetching ...

Multi-scale Contrastive Adaptor Learning for Segmenting Anything in Underperformed Scenes

Ke Zhou, Zhongwei Qiu, Dongmei Fu

TL;DR

This work tackles the challenge of adapting large vision models, specifically SAM, to downstream tasks with limited data by introducing MCA-SAM, a multi-scale contrastive adaptor framework. It jointly employs token-level and sample-level contrastive learning to strengthen local patch discriminability and global sample discrimination, while keeping the SAM encoder frozen. Empirical results across camouflage, shadow, and polyp segmentation show state-of-the-art performance with notable gains on COD10K, ISTD, CAMO, and Kvasir-SEG, confirming both effectiveness and efficiency. The approach offers a practical pathway to deploy strong foundational models in data-scarce domains with minimal parameter overhead.

Abstract

Foundational vision models, such as the Segment Anything Model (SAM), have achieved significant breakthroughs through extensive pre-training on large-scale visual datasets. Despite their general success, these models may fall short in specialized tasks with limited data, and fine-tuning such large-scale models is often not feasible. Current strategies involve incorporating adaptors into the pre-trained SAM to facilitate downstream task performance with minimal model adjustment. However, these strategies can be hampered by suboptimal learning approaches for the adaptors. In this paper, we introduce a novel Multi-scale Contrastive Adaptor learning method named MCA-SAM, which enhances adaptor performance through a meticulously designed contrastive learning framework at both token and sample levels. Our Token-level Contrastive adaptor (TC-adaptor) focuses on refining local representations by improving the discriminability of patch tokens, while the Sample-level Contrastive adaptor (SC-adaptor) amplifies global understanding across different samples. Together, these adaptors synergistically enhance feature comparison within and across samples, bolstering the model's representational strength and its ability to adapt to new tasks. Empirical results demonstrate that MCA-SAM sets new benchmarks, outperforming existing methods in three challenging domains: camouflage object detection, shadow segmentation, and polyp segmentation. Specifically, MCA-SAM exhibits substantial relative performance enhancements, achieving a 20.0% improvement in MAE on the COD10K dataset, a 6.0% improvement in MAE on the CAMO dataset, a 15.4% improvement in BER on the ISTD dataset, and a 7.9% improvement in mDice on the Kvasir-SEG dataset.

Multi-scale Contrastive Adaptor Learning for Segmenting Anything in Underperformed Scenes

TL;DR

This work tackles the challenge of adapting large vision models, specifically SAM, to downstream tasks with limited data by introducing MCA-SAM, a multi-scale contrastive adaptor framework. It jointly employs token-level and sample-level contrastive learning to strengthen local patch discriminability and global sample discrimination, while keeping the SAM encoder frozen. Empirical results across camouflage, shadow, and polyp segmentation show state-of-the-art performance with notable gains on COD10K, ISTD, CAMO, and Kvasir-SEG, confirming both effectiveness and efficiency. The approach offers a practical pathway to deploy strong foundational models in data-scarce domains with minimal parameter overhead.

Abstract

Foundational vision models, such as the Segment Anything Model (SAM), have achieved significant breakthroughs through extensive pre-training on large-scale visual datasets. Despite their general success, these models may fall short in specialized tasks with limited data, and fine-tuning such large-scale models is often not feasible. Current strategies involve incorporating adaptors into the pre-trained SAM to facilitate downstream task performance with minimal model adjustment. However, these strategies can be hampered by suboptimal learning approaches for the adaptors. In this paper, we introduce a novel Multi-scale Contrastive Adaptor learning method named MCA-SAM, which enhances adaptor performance through a meticulously designed contrastive learning framework at both token and sample levels. Our Token-level Contrastive adaptor (TC-adaptor) focuses on refining local representations by improving the discriminability of patch tokens, while the Sample-level Contrastive adaptor (SC-adaptor) amplifies global understanding across different samples. Together, these adaptors synergistically enhance feature comparison within and across samples, bolstering the model's representational strength and its ability to adapt to new tasks. Empirical results demonstrate that MCA-SAM sets new benchmarks, outperforming existing methods in three challenging domains: camouflage object detection, shadow segmentation, and polyp segmentation. Specifically, MCA-SAM exhibits substantial relative performance enhancements, achieving a 20.0% improvement in MAE on the COD10K dataset, a 6.0% improvement in MAE on the CAMO dataset, a 15.4% improvement in BER on the ISTD dataset, and a 7.9% improvement in mDice on the Kvasir-SEG dataset.
Paper Structure (34 sections, 16 equations, 8 figures, 8 tables)

This paper contains 34 sections, 16 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The illustrations of (a) current adaptor learning framework like SAM-adaptor chen2023sam and (b) our multi-scale contrastive adaptor learning framework (MCA-SAM), which consists of encoder, decoder, and Multi-scale Contrastive adaptors (MC-adaptors).
  • Figure 2: The framework of Multi-scale Contrastive Adaptor learning for SAM (MCA-SAM). During training, the MC-adaptors are inserted into each Transformer layer of SAM. The parameters of the image encoder in SAM are frozen, while only the parameters of adaptors and mask decoder are tunable. MC-adaptors include contrastive adaptors in both the token level and sample level, with the supervision of token-level contrastive loss and sample-level contrastive loss. During inference, the inputs of each Transformer layer are the summation of the outputs of the last layer and the outputs of the current adaptor. $\otimes$ represents element-wise sum.
  • Figure 3: The architecture of token-level contrastive adaptor (TC-adaptor) and sample-level contrastive adaptor (SC-adaptor). (a) TC-adaptor enhances the discriminability of SAM among local spatial tokens. (b) SC-adaptor enhances the discriminability of SAM among batch samples.
  • Figure 4: The visualization comparison with SAM-adaptor chen2023sam on the extremely challenging cases from COD10K fan2020camouflaged, CAMO le2019anabranch, ISTD wang2018stacked, and Kvasir jha2020kvasir datasets, respectively. MCA-SAM can localize some pixels that are difficult to distinguish due to the stronger capacity of the multi-scale contrastive adaptors.
  • Figure 5: The visual comparison between SAM-adaptor, TC-adaptor, SC-adaptor, and MC-adaptor on the case from Polyp jha2020kvasir dataset. All models are based on the backbone network of ViT-B.
  • ...and 3 more figures