Generative Medical Segmentation

Jiayu Huo; Xi Ouyang; Sébastien Ourselin; Rachel Sparks

Generative Medical Segmentation

Jiayu Huo, Xi Ouyang, Sébastien Ourselin, Rachel Sparks

TL;DR

Generative Medical Segmentation (GMS) tackles generalization gaps in medical image segmentation by using a frozen pre-trained vision foundation model to encode images and masks into latent spaces and a lightweight latent mapping model to translate image latents to mask latents, which are then decoded back to pixel space. This approach reduces trainable parameters and enhances cross-domain performance across five public datasets spanning ultrasound, histology, dermoscopy, and endoscopy. Experimental results show GMS outperforms discriminative and several generative baselines and exhibits strong domain generalization, particularly in cross-center ultrasound data, with ablation studies highlighting the importance of both latent- and image-space supervision and the SD-VAE tokenizer. The work suggests that latent-space generative segmentation, driven by foundation-model representations, provides a scalable and effective direction for medical image segmentation, with future plans to extend to 3D data.

Abstract

Rapid advancements in medical image segmentation performance have been significantly driven by the development of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). These models follow the discriminative pixel-wise classification learning paradigm and often have limited ability to generalize across diverse medical imaging datasets. In this manuscript, we introduce Generative Medical Segmentation (GMS), a novel approach leveraging a generative model to perform image segmentation. Concretely, GMS employs a robust pre-trained vision foundation model to extract latent representations for images and corresponding ground truth masks, followed by a model that learns a mapping function from the image to the mask in the latent space. Once trained, the model generates an estimated segmentation mask using the pre-trained vision foundation model to decode the predicted latent representation back into the image space. The design of GMS leads to fewer trainable parameters in the model which reduces the risk of overfitting and enhances its generalization capability. Our experimental analysis across five public datasets in different medical imaging domains demonstrates GMS outperforms existing discriminative and generative segmentation models. Furthermore, GMS is able to generalize well across datasets from different centers within the same imaging modality. Our experiments suggest GMS offers a scalable and effective solution for medical image segmentation. GMS implementation and trained model weights are available at https://github.com/King-HAW/GMS.

Generative Medical Segmentation

TL;DR

Abstract

Paper Structure (19 sections, 5 equations, 2 figures, 5 tables)

This paper contains 19 sections, 5 equations, 2 figures, 5 tables.

Introduction
Related Works
Medical Image Segmentation
Generative & Foundation Models
Methodology
Architecture Overview
Image Tokenizer
Latent Mapping Model (LMM)
Loss Functions
Experiments
Datasets
Implementation Details
Comparison with State-of-the-Art Models
Domain Generalization Ability
Qualitative Segmentation Results
...and 4 more sections

Figures (2)

Figure 1: GMS network architecture for 2D medical image segmentation. $\mathcal{E}$ and $\mathcal{D}$ represent a pre-trained vision foundation model and weights are frozen. We utilize the model weights from the Stable Diffusion VAE for $\mathcal{E}$ and $\mathcal{D}$. The latent mapping model (orange box) contains convolution blocks and self-attention blocks but does not contain down-sampling layers. Such a design helps to preserve the spatial information in the input feature vectors. Here, Conv means the 2D convolution operation, and GN represents the Group Normalization.
Figure 2: Exemplar segmentation results of GMS and other state-of-the-art methods. From top to bottom are images from the BUS, BUSI, GlaS, HAM10000 and Kvasir-Instrument datasets. The green contours are the ground truth, and the yellow contours are the model predictions. Zoom in for more details.

Generative Medical Segmentation

TL;DR

Abstract

Generative Medical Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)