Table of Contents
Fetching ...

Panoptic Segmentation of Mammograms with Text-To-Image Diffusion Model

Kun Zhao, Jakub Prokop, Javier Montalt Tordera, Sadegh Mohammadi

TL;DR

The paper addresses the need for unified semantic and instance segmentation of breast lesions in mammography by proposing M-ODISE, a panoptic segmentation framework that leverages a mammography-tuned text-to-image diffusion model (MAM-E) together with BiomedCLIP-based implicit captioning and Mask2Former to delineate lesions. It adapts open-vocabulary segmentation concepts to the medical domain and evaluates on CDD-CESM and VinDr-Mammo, showing that diffusion-based panoptic methods can outperform baselines in several metrics, though gains are dataset-dependent. The results highlight both the potential and limitations of applying panoptic diffusion-based segmentation to mammography, pointing to the importance of dataset quality, annotation consistency, and domain-specific encoders. This work paves the way for open-vocabulary, diffusion-guided lesion segmentation in mammography and motivates the development of larger, standardized datasets for clinical deployment.

Abstract

Mammography is crucial for breast cancer surveillance and early diagnosis. However, analyzing mammography images is a demanding task for radiologists, who often review hundreds of mammograms daily, leading to overdiagnosis and overtreatment. Computer-Aided Diagnosis (CAD) systems have been developed to assist in this process, but their capabilities, particularly in lesion segmentation, remained limited. With the contemporary advances in deep learning their performance may be improved. Recently, vision-language diffusion models emerged, demonstrating outstanding performance in image generation and transferability to various downstream tasks. We aim to harness their capabilities for breast lesion segmentation in a panoptic setting, which encompasses both semantic and instance-level predictions. Specifically, we propose leveraging pretrained features from a Stable Diffusion model as inputs to a state-of-the-art panoptic segmentation architecture, resulting in accurate delineation of individual breast lesions. To bridge the gap between natural and medical imaging domains, we incorporated a mammography-specific MAM-E diffusion model and BiomedCLIP image and text encoders into this framework. We evaluated our approach on two recently published mammography datasets, CDD-CESM and VinDr-Mammo. For the instance segmentation task, we noted 40.25 AP0.1 and 46.82 AP0.05, as well as 25.44 PQ0.1 and 26.92 PQ0.05. For the semantic segmentation task, we achieved Dice scores of 38.86 and 40.92, respectively.

Panoptic Segmentation of Mammograms with Text-To-Image Diffusion Model

TL;DR

The paper addresses the need for unified semantic and instance segmentation of breast lesions in mammography by proposing M-ODISE, a panoptic segmentation framework that leverages a mammography-tuned text-to-image diffusion model (MAM-E) together with BiomedCLIP-based implicit captioning and Mask2Former to delineate lesions. It adapts open-vocabulary segmentation concepts to the medical domain and evaluates on CDD-CESM and VinDr-Mammo, showing that diffusion-based panoptic methods can outperform baselines in several metrics, though gains are dataset-dependent. The results highlight both the potential and limitations of applying panoptic diffusion-based segmentation to mammography, pointing to the importance of dataset quality, annotation consistency, and domain-specific encoders. This work paves the way for open-vocabulary, diffusion-guided lesion segmentation in mammography and motivates the development of larger, standardized datasets for clinical deployment.

Abstract

Mammography is crucial for breast cancer surveillance and early diagnosis. However, analyzing mammography images is a demanding task for radiologists, who often review hundreds of mammograms daily, leading to overdiagnosis and overtreatment. Computer-Aided Diagnosis (CAD) systems have been developed to assist in this process, but their capabilities, particularly in lesion segmentation, remained limited. With the contemporary advances in deep learning their performance may be improved. Recently, vision-language diffusion models emerged, demonstrating outstanding performance in image generation and transferability to various downstream tasks. We aim to harness their capabilities for breast lesion segmentation in a panoptic setting, which encompasses both semantic and instance-level predictions. Specifically, we propose leveraging pretrained features from a Stable Diffusion model as inputs to a state-of-the-art panoptic segmentation architecture, resulting in accurate delineation of individual breast lesions. To bridge the gap between natural and medical imaging domains, we incorporated a mammography-specific MAM-E diffusion model and BiomedCLIP image and text encoders into this framework. We evaluated our approach on two recently published mammography datasets, CDD-CESM and VinDr-Mammo. For the instance segmentation task, we noted 40.25 AP0.1 and 46.82 AP0.05, as well as 25.44 PQ0.1 and 26.92 PQ0.05. For the semantic segmentation task, we achieved Dice scores of 38.86 and 40.92, respectively.
Paper Structure (11 sections, 2 equations, 4 figures, 2 tables)

This paper contains 11 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The overview of our framework, adapted from ODISE. Features extracted by the text-to-image diffusion model are passed to a mask generator, which outputs binary mask predictions and mask embeddings for individual objects detected in the image. These mask embeddings are then combined with category embeddings from the text encoder via a dot product to supervise the classification task. Additionally, an implicit captioner encodes the image to provide a conditioning signal for the diffusion process.
  • Figure 2: Qualitative visualization of M-ODISE predictions on the CDD-CESM dataset. For more visual examples please refer to the supplementary material.
  • Figure S1: Model performance across varying IoU thresholds. With a threshold value of 0.95, there were cases where no ground truth was matched; therefore, they are not included in the analysis.
  • Figure S2: Qualitative visualization of M-ODISE predictions on the CDD-CESM dataset. The ground truth mask, panoptic segmentation mask, instance segmentation mask, and semantic heatmap are presented for four samples.