Table of Contents
Fetching ...

CC-SAM: SAM with Cross-feature Attention and Context for Ultrasound Image Segmentation

Shreyank N Gowda, David A. Clifton

TL;DR

CC-SAM enhances ultrasound image segmentation by fusing a fixed CNN backbone with SAM’s ViT encoder through a variational attention fusion that models cross-modal uncertainty. Textual prompts generated by GPT-4 and embedded via MedBERT, together with Grounding-DINO bounding boxes, guide the prompt encoder to improve segmentation under challenging ultrasound conditions. The method achieves state-of-the-art Dice scores across seven ultrasound datasets and exhibits strong generalization to unseen data, while reducing computational cost through fixed backbones and adapters. This work demonstrates the practical potential of combining multimodal feature fusion, uncertainty-aware guidance, and language-driven prompts to adapt universal segmentation models to medical imaging tasks.

Abstract

The Segment Anything Model (SAM) has achieved remarkable successes in the realm of natural image segmentation, but its deployment in the medical imaging sphere has encountered challenges. Specifically, the model struggles with medical images that feature low contrast, faint boundaries, intricate morphologies, and small-sized objects. To address these challenges and enhance SAM's performance in the medical domain, we introduce a comprehensive modification. Firstly, we incorporate a frozen Convolutional Neural Network (CNN) branch as an image encoder, which synergizes with SAM's original Vision Transformer (ViT) encoder through a novel variational attention fusion module. This integration bolsters the model's capability to capture local spatial information, which is often paramount in medical imagery. Moreover, to further optimize SAM for medical imaging, we introduce feature and position adapters within the ViT branch, refining the encoder's representations. We see that compared to current prompting strategies to fine-tune SAM for ultrasound medical segmentation, the use of text descriptions that serve as text prompts for SAM helps significantly improve the performance. Leveraging ChatGPT's natural language understanding capabilities, we generate prompts that offer contextual information and guidance to SAM, enabling it to better understand the nuances of ultrasound medical images and improve its segmentation accuracy. Our method, in its entirety, represents a significant stride towards making universal image segmentation models more adaptable and efficient in the medical domain.

CC-SAM: SAM with Cross-feature Attention and Context for Ultrasound Image Segmentation

TL;DR

CC-SAM enhances ultrasound image segmentation by fusing a fixed CNN backbone with SAM’s ViT encoder through a variational attention fusion that models cross-modal uncertainty. Textual prompts generated by GPT-4 and embedded via MedBERT, together with Grounding-DINO bounding boxes, guide the prompt encoder to improve segmentation under challenging ultrasound conditions. The method achieves state-of-the-art Dice scores across seven ultrasound datasets and exhibits strong generalization to unseen data, while reducing computational cost through fixed backbones and adapters. This work demonstrates the practical potential of combining multimodal feature fusion, uncertainty-aware guidance, and language-driven prompts to adapt universal segmentation models to medical imaging tasks.

Abstract

The Segment Anything Model (SAM) has achieved remarkable successes in the realm of natural image segmentation, but its deployment in the medical imaging sphere has encountered challenges. Specifically, the model struggles with medical images that feature low contrast, faint boundaries, intricate morphologies, and small-sized objects. To address these challenges and enhance SAM's performance in the medical domain, we introduce a comprehensive modification. Firstly, we incorporate a frozen Convolutional Neural Network (CNN) branch as an image encoder, which synergizes with SAM's original Vision Transformer (ViT) encoder through a novel variational attention fusion module. This integration bolsters the model's capability to capture local spatial information, which is often paramount in medical imagery. Moreover, to further optimize SAM for medical imaging, we introduce feature and position adapters within the ViT branch, refining the encoder's representations. We see that compared to current prompting strategies to fine-tune SAM for ultrasound medical segmentation, the use of text descriptions that serve as text prompts for SAM helps significantly improve the performance. Leveraging ChatGPT's natural language understanding capabilities, we generate prompts that offer contextual information and guidance to SAM, enabling it to better understand the nuances of ultrasound medical images and improve its segmentation accuracy. Our method, in its entirety, represents a significant stride towards making universal image segmentation models more adaptable and efficient in the medical domain.
Paper Structure (28 sections, 9 equations, 8 figures, 3 tables)

This paper contains 28 sections, 9 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison of methods using SAM for medical image segmentation.
  • Figure 2: Overview of CC-SAM. We use adapters (or in the case of the CNN a FC layer) to enhance local and global features for ultrasound segmentation. 'OPaE' refers to overlapping patch embeddings and 'PE' refers to positional embeddings.
  • Figure 3: Overview of the proposed Variational Attention Fusion Block. Each 'mode' has an intra-modal uncertainty learning encoder (represented as E' or E" in the figure), these obtain robust modality-specific features in the latent subspace. Subsequently, VAF combines these inputs and constructs a multimodal representation by estimating weights that are specific to each modality, effectively capturing their dependencies.
  • Figure 4: Qualitative comparison between our CC-SAM method and the state-of-the-art (SOTA) task-specific techniques.
  • Figure 5: Comparison of CC-SAM with task-specific techniques on see datasets (highlighted in blue) and unseen datasets not previously encountered (indicated in orange). Higher orange bars indicate stronger generalization ability.
  • ...and 3 more figures