Table of Contents
Fetching ...

Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model

Zihan Zhong, Zhiqiang Tang, Tong He, Haoyang Fang, Chun Yuan

TL;DR

This paper tackles the gap where Segment Anything Model (SAM) underperforms in domain-specific segmentation by introducing Conv-LoRA, a parameter-efficient fine-tuning method that fuses LoRA with lightweight convolutions and a Mixture-of-Experts to inject multi-scale local priors into SAM's ViT encoder. By offering end-to-end multi-class segmentation and freezing the prompt encoder, Conv-LoRA enables SAM to capture high-level semantics beyond its binary mask pretraining. Extensive experiments across medical, natural, agricultural, and remote-sensing domains show Conv-LoRA consistently outperforms existing PEFT methods with minimal parameter overhead and favorable training efficiency. The results reveal that Conv-LoRA not only preserves SAM’s segmentation knowledge but also enhances its ability to learn nuanced semantic distinctions, suggesting broad applicability for real-world domain adaptation. The work also provides insights into SAM’s local priors, the role of multi-scale priors, and how adaptive scale selection via MoE benefits downstream segmentation tasks.

Abstract

The Segment Anything Model (SAM) stands as a foundational framework for image segmentation. While it exhibits remarkable zero-shot generalization in typical scenarios, its advantage diminishes when applied to specialized domains like medical imagery and remote sensing. To address this limitation, this paper introduces Conv-LoRA, a simple yet effective parameter-efficient fine-tuning approach. By integrating ultra-lightweight convolutional parameters into Low-Rank Adaptation (LoRA), Conv-LoRA can inject image-related inductive biases into the plain ViT encoder, further reinforcing SAM's local prior assumption. Notably, Conv-LoRA not only preserves SAM's extensive segmentation knowledge but also revives its capacity of learning high-level image semantics, which is constrained by SAM's foreground-background segmentation pretraining. Comprehensive experimentation across diverse benchmarks spanning multiple domains underscores Conv-LoRA's superiority in adapting SAM to real-world semantic segmentation tasks.

Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model

TL;DR

This paper tackles the gap where Segment Anything Model (SAM) underperforms in domain-specific segmentation by introducing Conv-LoRA, a parameter-efficient fine-tuning method that fuses LoRA with lightweight convolutions and a Mixture-of-Experts to inject multi-scale local priors into SAM's ViT encoder. By offering end-to-end multi-class segmentation and freezing the prompt encoder, Conv-LoRA enables SAM to capture high-level semantics beyond its binary mask pretraining. Extensive experiments across medical, natural, agricultural, and remote-sensing domains show Conv-LoRA consistently outperforms existing PEFT methods with minimal parameter overhead and favorable training efficiency. The results reveal that Conv-LoRA not only preserves SAM’s segmentation knowledge but also enhances its ability to learn nuanced semantic distinctions, suggesting broad applicability for real-world domain adaptation. The work also provides insights into SAM’s local priors, the role of multi-scale priors, and how adaptive scale selection via MoE benefits downstream segmentation tasks.

Abstract

The Segment Anything Model (SAM) stands as a foundational framework for image segmentation. While it exhibits remarkable zero-shot generalization in typical scenarios, its advantage diminishes when applied to specialized domains like medical imagery and remote sensing. To address this limitation, this paper introduces Conv-LoRA, a simple yet effective parameter-efficient fine-tuning approach. By integrating ultra-lightweight convolutional parameters into Low-Rank Adaptation (LoRA), Conv-LoRA can inject image-related inductive biases into the plain ViT encoder, further reinforcing SAM's local prior assumption. Notably, Conv-LoRA not only preserves SAM's extensive segmentation knowledge but also revives its capacity of learning high-level image semantics, which is constrained by SAM's foreground-background segmentation pretraining. Comprehensive experimentation across diverse benchmarks spanning multiple domains underscores Conv-LoRA's superiority in adapting SAM to real-world semantic segmentation tasks.
Paper Structure (19 sections, 6 equations, 14 figures, 15 tables)

This paper contains 19 sections, 6 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: Comparison of VPT, LoRA, and Conv-LoRA (ours) in binary-class road segmentation (top) and multi-class transparent object segmentation (bottom). Conv-LoRA reinforces image-related local priors, allowing SAM to separate roads from adjacent buildings, while LoRA and VPT struggle in this regard. In the second row, VPT produces a reasonable mask for the bowls but erroneously assigns them to the jar/kettle class (indicated by object color), revealing SAM's limited high-level semantic understanding. Both LoRA and Conv-LoRA rectify this misclassification through finetuing SAM's image encoder, with Conv-LoRA delivering a cleaner mask with fewer boundary artifacts.
  • Figure 2: LoRA vs. Conv-LoRA. Both LoRA and Conv-LoRA add an extra trainable encoder-decoder structure parallel to the frozen pre-trained weights. Inside the bottleneck of LoRA, Conv-LoRA inserts lightweight convolution operations managed by MoE with negligible extra parameters.
  • Figure 3: MoE-Conv. It consists of $n$ experts and a gating network for dynamic expert selection. Each expert reconstructs feature maps at a specific scale, applies convolution, and returns the feature maps to the default scale. Each expert specializes in one unique feature scale.
  • Figure 4: The modified SAM's mask decoder for multi-class semantic segmentation. The classification module (within the red dashed box) is new added compared to original SAM's mask decoder. $N$ is the number of output mask tokens, $K$ is the number of classes, $C$ is the number of channels, $H$ and $W$ indicate the height and width of the feature map (we omit 'batch size' for simplicity).
  • Figure 5: Mean attention distance of each attention head, with each dot indicating the mean distance across images for one of the 16 heads at one layer. In contrast to MAE, SAM retains the ability to incorporate local information even in deeper layers.
  • ...and 9 more figures