Table of Contents
Fetching ...

CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model

Aoran Xiao, Weihao Xuan, Heli Qi, Yun Xing, Ruijie Ren, Xiaoqin Zhang, Ling Shao, Shijian Lu

TL;DR

The paper tackles the challenge of adapting SAM to unconventional domains with limited labeled data. It introduces CAT-SAM, a decoder-conditioned joint tuning framework that freezes SAM and uses a Prompt Bridge to couple the large image encoder with the lightweight mask decoder, enabling data-efficient adaptation. Two practical variants, CAT-SAM-T (prompt-tuning) and CAT-SAM-A (adapters), implement the bridge within different tuning schemes. Across 11 diverse tasks and 8 datasets, CAT-SAM delivers consistent improvements in one-shot and few-shot settings, including non-RGB domains, demonstrating strong cross-domain transfer and end-to-end segmentation performance without full fine-tuning of SAM.

Abstract

The recent Segment Anything Model (SAM) has demonstrated remarkable zero-shot capability and flexible geometric prompting in general image segmentation. However, SAM often struggles when handling various unconventional images, such as aerial, medical, and non-RGB images. This paper presents CAT-SAM, a ConditionAl Tuning network that adapts SAM toward various unconventional target tasks with just few-shot target samples. CAT-SAM freezes the entire SAM and adapts its mask decoder and image encoder simultaneously with a small number of learnable parameters. The core design is a prompt bridge structure that enables decoder-conditioned joint tuning of the heavyweight image encoder and the lightweight mask decoder. The bridging maps the prompt token of the mask decoder to the image encoder, fostering synergic adaptation of the encoder and the decoder with mutual benefits. We develop two representative tuning strategies for the image encoder which leads to two CAT-SAM variants: one injecting learnable prompt tokens in the input space and the other inserting lightweight adapter networks. Extensive experiments over 11 unconventional tasks show that both CAT-SAM variants achieve superior target segmentation performance consistently even under the very challenging one-shot adaptation setup. Project page: https://xiaoaoran.github.io/projects/CAT-SAM

CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model

TL;DR

The paper tackles the challenge of adapting SAM to unconventional domains with limited labeled data. It introduces CAT-SAM, a decoder-conditioned joint tuning framework that freezes SAM and uses a Prompt Bridge to couple the large image encoder with the lightweight mask decoder, enabling data-efficient adaptation. Two practical variants, CAT-SAM-T (prompt-tuning) and CAT-SAM-A (adapters), implement the bridge within different tuning schemes. Across 11 diverse tasks and 8 datasets, CAT-SAM delivers consistent improvements in one-shot and few-shot settings, including non-RGB domains, demonstrating strong cross-domain transfer and end-to-end segmentation performance without full fine-tuning of SAM.

Abstract

The recent Segment Anything Model (SAM) has demonstrated remarkable zero-shot capability and flexible geometric prompting in general image segmentation. However, SAM often struggles when handling various unconventional images, such as aerial, medical, and non-RGB images. This paper presents CAT-SAM, a ConditionAl Tuning network that adapts SAM toward various unconventional target tasks with just few-shot target samples. CAT-SAM freezes the entire SAM and adapts its mask decoder and image encoder simultaneously with a small number of learnable parameters. The core design is a prompt bridge structure that enables decoder-conditioned joint tuning of the heavyweight image encoder and the lightweight mask decoder. The bridging maps the prompt token of the mask decoder to the image encoder, fostering synergic adaptation of the encoder and the decoder with mutual benefits. We develop two representative tuning strategies for the image encoder which leads to two CAT-SAM variants: one injecting learnable prompt tokens in the input space and the other inserting lightweight adapter networks. Extensive experiments over 11 unconventional tasks show that both CAT-SAM variants achieve superior target segmentation performance consistently even under the very challenging one-shot adaptation setup. Project page: https://xiaoaoran.github.io/projects/CAT-SAM
Paper Structure (25 sections, 8 figures, 12 tables)

This paper contains 25 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: The proposed CAT-SAM performs ConditionAl joint Tuning (CAT) to establish communication between SAM's heavyweight image encoder and lightweight mask decoder. This enables synergistic adaptation of the two network components, mitigating tuning imbalances and improving few-shot SAM adaptation.
  • Figure 2: Overview of CAT-SAM. CAT-SAM keeps the whole SAM frozen while simultaneously tuning the image encoder and mask decoder for downstream adaptation. To address the tuning imbalance between these two network components, we introduce decoder-conditioned joint tuning through the design of Prompt Bridge structure, enabling synergetic and enhanced adaptation. We present two CAT-SAM variants: CAT-SAM-T in (a) and CAT-SAM-A in (b), achieved by integrating the prompt bridge with prompt-based and adapter-based tuning strategies for the image encoder, respectively. (c) illustrates two tailored prompt bridge structures, PB-T and PB-A.
  • Figure 3: Single point to mask evaluation for one-shot adaptation.
  • Figure 4: Visual comparisons of SAM kirillov2023segment (top row) and CAT-SAM (bottom row). We illustrate samples from WHU ji2018fully for building segmentation, MA. Roads MnihThesis for road segmentation, SBU-Shadow vicente2016large for shadow segmentation, Kvasir pogorelov2017kvasir for polyp segmentation, JSRT shiraishi2000development for chest organ segmentation (X-ray images), FLS singh2021marine for marine debris segmentation (Sonar images), and HRSID wei2020hrsid for ship instance segmentation (SAR images). CAT-SAM exhibits one-shot adaptation across most datasets, except for 16-shot over FLS. Red boxes and stars denote geometric prompts, colored regions are mask predictions, and lines show the boundary of ground truth segmentation.
  • Figure 5: Details of the mask decoder in CAT-SAM. Both CAT-SAM variants share the same decoder tuning structures.
  • ...and 3 more figures