Table of Contents
Fetching ...

SAMCL: Empowering SAM to Continually Learn from Dynamic Domains with Extreme Storage Efficiency

Zeqing Wang, Kangye Ji, Di Wang, Haibin Zhang, Fei Cheng

TL;DR

SAMCL tackles SAM's open-world segmentation challenges by introducing a modular continual learning framework. It splits incremental knowledge into dedicated modules (AugModule) and uses a lightweight Module Selector trained on compact embeddings to pick the right module during inference, dramatically reducing storage compared to traditional replay-based methods. The approach achieves minimal forgetting (as low as 0.19%) and gains on unseen domains (≥2.5%), with substantial storage savings (up to 256× or more) and efficient deployment. This work enables SAM to adapt continuously to dynamic domains while remaining practical for large-scale, multi-domain deployments.

Abstract

Segment Anything Model (SAM) struggles in open-world scenarios with diverse domains. In such settings, naive fine-tuning with a well-designed learning module is inadequate and often causes catastrophic forgetting issue when learning incrementally. To address this issue, we propose a novel continual learning (CL) method for SAM, termed SAMCL. Rather than relying on a fixed learning module, our method decomposes incremental knowledge into separate modules and trains a selector to choose the appropriate one during inference. However, this intuitive design introduces two key challenges: ensuring effective module learning and selection, and managing storage as tasks accumulate. To tackle these, we introduce two components: AugModule and Module Selector. AugModule reduces the storage of the popular LoRA learning module by sharing parameters across layers while maintaining accuracy. It also employs heatmaps-generated from point prompts-to further enhance domain adaptation with minimal additional cost. Module Selector leverages the observation that SAM's embeddings can effectively distinguish domains, enabling high selection accuracy by training on low-consumed embeddings instead of raw images. Experiments show that SAMCL outperforms state-of-the-art methods, achieving only 0.19% forgetting and at least 2.5% gain on unseen domains. Each AugModule requires just 0.233 MB, reducing storage by at least 24.3% over other fine-tuning approaches. The buffer storage for Module Selector is further reduced by up to 256$\times$.

SAMCL: Empowering SAM to Continually Learn from Dynamic Domains with Extreme Storage Efficiency

TL;DR

SAMCL tackles SAM's open-world segmentation challenges by introducing a modular continual learning framework. It splits incremental knowledge into dedicated modules (AugModule) and uses a lightweight Module Selector trained on compact embeddings to pick the right module during inference, dramatically reducing storage compared to traditional replay-based methods. The approach achieves minimal forgetting (as low as 0.19%) and gains on unseen domains (≥2.5%), with substantial storage savings (up to 256× or more) and efficient deployment. This work enables SAM to adapt continuously to dynamic domains while remaining practical for large-scale, multi-domain deployments.

Abstract

Segment Anything Model (SAM) struggles in open-world scenarios with diverse domains. In such settings, naive fine-tuning with a well-designed learning module is inadequate and often causes catastrophic forgetting issue when learning incrementally. To address this issue, we propose a novel continual learning (CL) method for SAM, termed SAMCL. Rather than relying on a fixed learning module, our method decomposes incremental knowledge into separate modules and trains a selector to choose the appropriate one during inference. However, this intuitive design introduces two key challenges: ensuring effective module learning and selection, and managing storage as tasks accumulate. To tackle these, we introduce two components: AugModule and Module Selector. AugModule reduces the storage of the popular LoRA learning module by sharing parameters across layers while maintaining accuracy. It also employs heatmaps-generated from point prompts-to further enhance domain adaptation with minimal additional cost. Module Selector leverages the observation that SAM's embeddings can effectively distinguish domains, enabling high selection accuracy by training on low-consumed embeddings instead of raw images. Experiments show that SAMCL outperforms state-of-the-art methods, achieving only 0.19% forgetting and at least 2.5% gain on unseen domains. Each AugModule requires just 0.233 MB, reducing storage by at least 24.3% over other fine-tuning approaches. The buffer storage for Module Selector is further reduced by up to 256.

Paper Structure

This paper contains 39 sections, 7 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of SAMCL with SAM. During training, SAMCL uses a new AugModule to learn from a new domain. All modules are stored in a module set. Meanwhile, Module Selector is trained on a few stored embeddings from the image encoder. During inference, SAMCL extracts latent embeddings from the image encoder to select an appropriate module by Module Selector and continual inference by SAM with the selected module. A detailed illustration of the inference process is provided in the Appendix B.1.
  • Figure 2: Illustration of AugModule, integrating SLoRA and Prompt Augmentation. SLoRA shares matrix $A$ across all LoRAs for efficient adaptation. Prompt Augmentation converts point prompts into heatmaps, injected via linear transformation for dimensional alignment.
  • Figure 3: Comparison among AsymmLoRA, vanilla LoRA, and SLoRA. Each method fine-tunes SAM’s image encoder on the CAMO, COD, ISTD, ISIC, and Kvasir datasets for 20 epochs. (a) shows results without Prompt Augmentation, and (b) shows results with Prompt Augmentation.
  • Figure 4: Overview of the Module Selector. We extract a fixed number ($N_e$) of embeddings from a specific block in the image encoder and reduce their dimensions from $(N_e, H, W, D)$ to $(N_e, D)$ by averaging over $H$ and $W$ dimensions. These embeddings ($e_i \in \mathbb{R}^{D}$, where $i = 1, \dots, N_e$) are stored in a buffer aggregating embeddings from all learned domains. The Module Selector, a lightweight MLP with four linear layers, is trained using cross-entropy loss for 25 epochs on this embedding buffer. The layout of the four layers is detailed in the Appendix B.2.
  • Figure 5: Selection accuracy across blocks in the image encoder of SAM (ViT-b). Each line denotes a different number of stored embeddings per dataset, all showing a similar trend. Results indicate high domain classification accuracy, with middle blocks performing best.
  • ...and 9 more figures