Table of Contents
Fetching ...

MedicoSAM: Robust Improvement of SAM for Medical Imaging

Anwai Archit, Luca Freckmann, Constantin Pape

TL;DR

This work systematically evaluates how finetuning Segment Anything Model (SAM) on large medical datasets affects interactive and semantic segmentation across 2D and 3D modalities. It introduces MedicoSAM, a finetuned model with a full iterative training objective that balances box and mask prompts, achieving robust improvements in interactive segmentation while preserving compatibility with annotation tools. The study also explores domain-specific pretraining for semantic segmentation, finding modest gains that sometimes lag behind strong 3D baselines like nnU-Net. Publicly releasing MedicoSAM, the paper highlights practical pathways to adapt foundation models to medical imaging while stressing the importance of maintaining tool interoperability for real-world annotation workflows.

Abstract

Medical image segmentation is an important analysis task in clinical practice and research. Deep learning has massively advanced the field, but current approaches are mostly based on models trained for a specific task. Training such models or adapting them to a new condition is costly due to the need for (manually) labeled data. The emergence of vision foundation models, especially Segment Anything, offers a path to universal segmentation for medical images, overcoming these issues. Here, we study how to improve Segment Anything for medical images by comparing different finetuning strategies on a large and diverse dataset. We evaluate the finetuned models on a wide range of interactive and (automatic) semantic segmentation tasks. We find that the performance can be clearly improved for interactive segmentation. However, semantic segmentation does not benefit from pretraining on medical images. Our best model, MedicoSAM, is publicly available at https://github.com/computational-cell-analytics/medico-sam. We show that it is compatible with existing tools for data annotation and believe that it will be of great practical value.

MedicoSAM: Robust Improvement of SAM for Medical Imaging

TL;DR

This work systematically evaluates how finetuning Segment Anything Model (SAM) on large medical datasets affects interactive and semantic segmentation across 2D and 3D modalities. It introduces MedicoSAM, a finetuned model with a full iterative training objective that balances box and mask prompts, achieving robust improvements in interactive segmentation while preserving compatibility with annotation tools. The study also explores domain-specific pretraining for semantic segmentation, finding modest gains that sometimes lag behind strong 3D baselines like nnU-Net. Publicly releasing MedicoSAM, the paper highlights practical pathways to adapt foundation models to medical imaging while stressing the importance of maintaining tool interoperability for real-world annotation workflows.

Abstract

Medical image segmentation is an important analysis task in clinical practice and research. Deep learning has massively advanced the field, but current approaches are mostly based on models trained for a specific task. Training such models or adapting them to a new condition is costly due to the need for (manually) labeled data. The emergence of vision foundation models, especially Segment Anything, offers a path to universal segmentation for medical images, overcoming these issues. Here, we study how to improve Segment Anything for medical images by comparing different finetuning strategies on a large and diverse dataset. We evaluate the finetuned models on a wide range of interactive and (automatic) semantic segmentation tasks. We find that the performance can be clearly improved for interactive segmentation. However, semantic segmentation does not benefit from pretraining on medical images. Our best model, MedicoSAM, is publicly available at https://github.com/computational-cell-analytics/medico-sam. We show that it is compatible with existing tools for data annotation and believe that it will be of great practical value.
Paper Structure (12 sections, 9 figures, 1 table, 1 algorithm)

This paper contains 12 sections, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: a) Contribution overview: We finetune SAM on a large medical dataset to build MedicoSAM. We evaluate it for interactive and semantic segmentation. The latter requires training on additional annotated data (that could be generated via interactive segmentation), for 2D and 3D data. b) Results for interactive 2D segmentation, comparing MedicoSAM and other models derived from SAM. We report the average over 16 datasets for segmentation with a point (green) or box (yellow) prompt and segmentation after iterative correction starting from a point (dark green) or box (dark purple). c) Results for interactive 3D segmentation. We report the average over 6 different datasets for segmentation based on a single point or box. d, e). Results for semantic 2D and 3D segmentation. We report the average over 6 datasets in both cases. The three best methods are highlighted in decreasing shades of blue, darker indicates better results, gray otherwise.
  • Figure 2: a) The SAM architecture for interactive segmentation consists of image encoder, prompt encoder (split into a part for mask prompts and for point/box prompts), and mask decoder. For 3D interactive segmentation, we propagate prompts across the depth axes. In addition, we add a convolutional decoder for automated segmentation (orange). This decoder is pre-trained with a binary segmentation task (blue masks), jointly with training for interactive segmentation.
  • Figure 3: a) Overall results for interactive 2D segmentation. We report the Dice coefficient for simulated interactive segmentation. Each bar corresponds to the result of a correction iteration, starting either from a point (green) or a box (yellow) prompt. The result after correction is highlighted in dark green / dark purple. We compare 10 different models. Models trained by us are marked with a * and the model trained on the entire dataset is marked in bold font. The same model notation is used in all figures. b) Interactive segmentation results for 16 individual datasets. We report the absolute difference of the Dice coefficient compared to the original SAM and report only the results for the initial and final segmentation.
  • Figure 4: Results for interactive 3D segmentation for 6 different datasets. We report the difference in Dice score compared to SAM for four other models. Segmentations are derived from a single point (green) or box (yellow) prompt placed in the central slice for each object in the respective dataset. We use the implementation of archit2023segment for methods using SAM, determining the best method for prompt propagation on a separate validation set, see also Sec. \ref{['sec:methods_interactive3d']}. SAM2 supports 3D segmentation by default.
  • Figure 5: a) Qualitative results for interactive 2D segmentation. We compare interactive segmentation based on a single point or single box prompt (cyan) with SAM, MedSAM and MedicoSAM for nine different datasets. For each image, we show prompts with a large improvement of MedicoSAM over SAM and the corresponding MedSAM result. b) Outputs of the image encoder from the three different models on different datasets, additionally Abdominal CT curvas and Brain MRI pedims visualized by their three main PCA components. MedSAM and MedicoSAM seem to learn a more discriminative representation with clearer distinction of background.
  • ...and 4 more figures