Table of Contents
Fetching ...

CLISC: Bridging clip and sam by enhanced cam for unsupervised brain tumor segmentation

Xiaochuan Ma, Jia Fu, Wenjun Liao, Shichuan Zhang, Guotai Wang

TL;DR

This work tackles the challenge of unsupervised brain tumor segmentation by leveraging foundation models. It introduces CLISC, a framework that uses CLIP-derived image labels to train a classifier and generate CAMs, enhances CAM with Adaptive Masking-based Data Augmentation (AMDA), and uses CAM-derived prompts to drive the Segment Anything Model (SAM) to produce pseudo-labels for segmentation. A 3D U-Net is then trained with a self-training loop that filters low-quality labels via SAM-Seg Similarity Filtering (S3F), achieving an average Dice score of $85.60\%$ and $HD_{95}$ of $6.72$ mm on BraTS2020, outperforming several unsupervised baselines and nearing fully supervised performance. The method reduces annotation costs and demonstrates the potential of combining CLIP and SAM for robust medical image segmentation, with future work extending to tumor substructures and other organs.

Abstract

Brain tumor segmentation is important for diagnosis of the tumor, and current deep-learning methods rely on a large set of annotated images for training, with high annotation costs. Unsupervised segmentation is promising to avoid human annotations while the performance is often limited. In this study, we present a novel unsupervised segmentation approach that leverages the capabilities of foundation models, and it consists of three main steps: (1) A vision-language model (i.e., CLIP) is employed to obtain image-level pseudo-labels for training a classification network. Class Activation Mapping (CAM) is then employed to extract Regions of Interest (ROIs), where an adaptive masking-based data augmentation is used to enhance ROI identification.(2) The ROIs are used to generate bounding box and point prompts for the Segment Anything Model (SAM) to obtain segmentation pseudo-labels. (3) A 3D segmentation network is trained with the SAM-derived pseudo-labels, where low-quality pseudo-labels are filtered out in a self-learning process based on the similarity between the SAM's output and the network's prediction. Evaluation on the BraTS2020 dataset demonstrates that our approach obtained an average Dice Similarity Score (DSC) of 85.60%, outperforming five state-of-the-art unsupervised segmentation methods by more than 10 percentage points. Besides, our approach outperforms directly using SAM for zero-shot inference, and its performance is close to fully supervised learning.

CLISC: Bridging clip and sam by enhanced cam for unsupervised brain tumor segmentation

TL;DR

This work tackles the challenge of unsupervised brain tumor segmentation by leveraging foundation models. It introduces CLISC, a framework that uses CLIP-derived image labels to train a classifier and generate CAMs, enhances CAM with Adaptive Masking-based Data Augmentation (AMDA), and uses CAM-derived prompts to drive the Segment Anything Model (SAM) to produce pseudo-labels for segmentation. A 3D U-Net is then trained with a self-training loop that filters low-quality labels via SAM-Seg Similarity Filtering (S3F), achieving an average Dice score of and of mm on BraTS2020, outperforming several unsupervised baselines and nearing fully supervised performance. The method reduces annotation costs and demonstrates the potential of combining CLIP and SAM for robust medical image segmentation, with future work extending to tumor substructures and other organs.

Abstract

Brain tumor segmentation is important for diagnosis of the tumor, and current deep-learning methods rely on a large set of annotated images for training, with high annotation costs. Unsupervised segmentation is promising to avoid human annotations while the performance is often limited. In this study, we present a novel unsupervised segmentation approach that leverages the capabilities of foundation models, and it consists of three main steps: (1) A vision-language model (i.e., CLIP) is employed to obtain image-level pseudo-labels for training a classification network. Class Activation Mapping (CAM) is then employed to extract Regions of Interest (ROIs), where an adaptive masking-based data augmentation is used to enhance ROI identification.(2) The ROIs are used to generate bounding box and point prompts for the Segment Anything Model (SAM) to obtain segmentation pseudo-labels. (3) A 3D segmentation network is trained with the SAM-derived pseudo-labels, where low-quality pseudo-labels are filtered out in a self-learning process based on the similarity between the SAM's output and the network's prediction. Evaluation on the BraTS2020 dataset demonstrates that our approach obtained an average Dice Similarity Score (DSC) of 85.60%, outperforming five state-of-the-art unsupervised segmentation methods by more than 10 percentage points. Besides, our approach outperforms directly using SAM for zero-shot inference, and its performance is close to fully supervised learning.

Paper Structure

This paper contains 14 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of our proposed CLISC framework for unsupervised brain tumor segmentation.
  • Figure 2: Visual comparison of different unsupervised methods on brain tumor with small, medium, and large sizes. The red curves are contours of the ground truths.
  • Figure 3: Effect of hyperparameters for AMDA and S3F.