Table of Contents
Fetching ...

SAMora: Enhancing SAM through Hierarchical Self-Supervised Pre-Training for Medical Images

Shuhang Chen, Hangjie Yuan, Pengwei Liu, Hanxue Gu, Tao Feng, Dong Ni

TL;DR

SAMora tackles the challenge of adapting SAM to medical image segmentation with limited labels by introducing a two-stage framework that first self-supervised-trains three LoRA experts at image, patch, and pixel levels, then fuses them with HL-Attn during prompt-free fine-tuning. The image-level, patch-level, and pixel-level stages utilize SimCLRv2, MAE, and denoising autoencoders, respectively, with continual pre-training to domain medical data, followed by a cross-attention-based hierarchical fusion that freezes the encoder and LoRA weights during fine-tuning. Key contributions include the hierarchical LoRA fusion (HL-Attn), compatibility with SAM variants (e.g., SAM2, SAMed, H-SAM), and state-of-the-art performance on Synapse, LA, and PROMISE12 in both few-shot and fully supervised settings, with a substantial reduction in fine-tuning epochs (notably $r=4$ for LoRA). The approach demonstrates strong practical impact by leveraging abundant unlabeled data to improve medical segmentation while maintaining efficiency, and the released code enables easy adoption across SAM-based pipelines.

Abstract

The Segment Anything Model (SAM) has demonstrated significant potential in medical image segmentation. Yet, its performance is limited when only a small amount of labeled data is available, while there is abundant valuable yet often overlooked hierarchical information in medical data. To address this limitation, we draw inspiration from self-supervised learning and propose SAMora, an innovative framework that captures hierarchical medical knowledge by applying complementary self-supervised learning objectives at the image, patch, and pixel levels. To fully exploit the complementarity of hierarchical knowledge within LoRAs, we introduce HL-Attn, a hierarchical fusion module that integrates multi-scale features while maintaining their distinct characteristics. SAMora is compatible with various SAM variants, including SAM2, SAMed, and H-SAM. Experimental results on the Synapse, LA, and PROMISE12 datasets demonstrate that SAMora outperforms existing SAM variants. It achieves state-of-the-art performance in both few-shot and fully supervised settings while reducing fine-tuning epochs by 90%. The code is available at https://github.com/ShChen233/SAMora.

SAMora: Enhancing SAM through Hierarchical Self-Supervised Pre-Training for Medical Images

TL;DR

SAMora tackles the challenge of adapting SAM to medical image segmentation with limited labels by introducing a two-stage framework that first self-supervised-trains three LoRA experts at image, patch, and pixel levels, then fuses them with HL-Attn during prompt-free fine-tuning. The image-level, patch-level, and pixel-level stages utilize SimCLRv2, MAE, and denoising autoencoders, respectively, with continual pre-training to domain medical data, followed by a cross-attention-based hierarchical fusion that freezes the encoder and LoRA weights during fine-tuning. Key contributions include the hierarchical LoRA fusion (HL-Attn), compatibility with SAM variants (e.g., SAM2, SAMed, H-SAM), and state-of-the-art performance on Synapse, LA, and PROMISE12 in both few-shot and fully supervised settings, with a substantial reduction in fine-tuning epochs (notably for LoRA). The approach demonstrates strong practical impact by leveraging abundant unlabeled data to improve medical segmentation while maintaining efficiency, and the released code enables easy adoption across SAM-based pipelines.

Abstract

The Segment Anything Model (SAM) has demonstrated significant potential in medical image segmentation. Yet, its performance is limited when only a small amount of labeled data is available, while there is abundant valuable yet often overlooked hierarchical information in medical data. To address this limitation, we draw inspiration from self-supervised learning and propose SAMora, an innovative framework that captures hierarchical medical knowledge by applying complementary self-supervised learning objectives at the image, patch, and pixel levels. To fully exploit the complementarity of hierarchical knowledge within LoRAs, we introduce HL-Attn, a hierarchical fusion module that integrates multi-scale features while maintaining their distinct characteristics. SAMora is compatible with various SAM variants, including SAM2, SAMed, and H-SAM. Experimental results on the Synapse, LA, and PROMISE12 datasets demonstrate that SAMora outperforms existing SAM variants. It achieves state-of-the-art performance in both few-shot and fully supervised settings while reducing fine-tuning epochs by 90%. The code is available at https://github.com/ShChen233/SAMora.

Paper Structure

This paper contains 28 sections, 11 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: The Hierarchical Characteristics of Multi-Level Pre-training Tasks on Medical Images. The abundant hierarchical characteristics inherent in vast amounts of unlabeled data, when effectively fused, can significantly enhance the segmentation performance of SAM.
  • Figure 2: The Overview of SAMora. The training process of SAMora is divided into two stages. Stage 1 involves self-supervised pre-training using different LoRA experts across hierarchical levels. Each level employs a distinct self-supervised learning method: SimCLRv2 for the image level, MAE for the patch level, and denoising autoencoder for the pixel level. Continual Pre-Training (CPT) is applied to adapt the teacher models (SimCLRv2 and MAE) to the medical imaging domain. Stage 2 focuses on fine-tuning with labeled data, where the SAM encoder and LoRA experts remain frozen, and only the HL-Attn and Decoder components are tuned. The projector is a trainable dimension-alignment module.
  • Figure 3: The Structure of HL-Attn. Note that self-attention is not visualized in this figure.
  • Figure 4: The Performance of SAMora on Synapse Dataset.
  • Figure 5: The Visual Heatmaps between SAMed and SAMora. The heatmaps display regions of interest with varying levels of relevance, where red denotes areas of high attention, yellow indicates moderate attention, and blue represents low or no attention
  • ...and 1 more figures