Table of Contents
Fetching ...

Enhancing Label-efficient Medical Image Segmentation with Text-guided Diffusion Models

Chun-Mei Feng

TL;DR

This paper tackles the labeling bottleneck in medical image segmentation by augmenting diffusion-based representations with inexpensive medical text annotations. It introduces TextDiff, a dual-branch framework that freezes the image diffusion encoder and a clinical text encoder while training a cross-modal attention module and a pixel classifier to produce segmentation maps. The approach aligns diffusion latent activations with diagnostic text, achieving state-of-the-art Dice and IoU on MoNuSeg and QaTa-COVID19 with only a few labeled examples and a small parameter footprint (~9.68M). This work highlights the practical potential of text-guided diffusion for label-efficient medical image analysis and paves the way for text-informed diffusion models in clinical workflows.

Abstract

Aside from offering state-of-the-art performance in medical image generation, denoising diffusion probabilistic models (DPM) can also serve as a representation learner to capture semantic information and potentially be used as an image representation for downstream tasks, e.g., segmentation. However, these latent semantic representations rely heavily on labor-intensive pixel-level annotations as supervision, limiting the usability of DPM in medical image segmentation. To address this limitation, we propose an enhanced diffusion segmentation model, called TextDiff, that improves semantic representation through inexpensive medical text annotations, thereby explicitly establishing semantic representation and language correspondence for diffusion models. Concretely, TextDiff extracts intermediate activations of the Markov step of the reverse diffusion process in a pretrained diffusion model on large-scale natural images and learns additional expert knowledge by combining them with complementary and readily available diagnostic text information. TextDiff freezes the dual-branch multi-modal structure and mines the latent alignment of semantic features in diffusion models with diagnostic descriptions by only training the cross-attention mechanism and pixel classifier, making it possible to enhance semantic representation with inexpensive text. Extensive experiments on public QaTa-COVID19 and MoNuSeg datasets show that our TextDiff is significantly superior to the state-of-the-art multi-modal segmentation methods with only a few training samples.

Enhancing Label-efficient Medical Image Segmentation with Text-guided Diffusion Models

TL;DR

This paper tackles the labeling bottleneck in medical image segmentation by augmenting diffusion-based representations with inexpensive medical text annotations. It introduces TextDiff, a dual-branch framework that freezes the image diffusion encoder and a clinical text encoder while training a cross-modal attention module and a pixel classifier to produce segmentation maps. The approach aligns diffusion latent activations with diagnostic text, achieving state-of-the-art Dice and IoU on MoNuSeg and QaTa-COVID19 with only a few labeled examples and a small parameter footprint (~9.68M). This work highlights the practical potential of text-guided diffusion for label-efficient medical image analysis and paves the way for text-informed diffusion models in clinical workflows.

Abstract

Aside from offering state-of-the-art performance in medical image generation, denoising diffusion probabilistic models (DPM) can also serve as a representation learner to capture semantic information and potentially be used as an image representation for downstream tasks, e.g., segmentation. However, these latent semantic representations rely heavily on labor-intensive pixel-level annotations as supervision, limiting the usability of DPM in medical image segmentation. To address this limitation, we propose an enhanced diffusion segmentation model, called TextDiff, that improves semantic representation through inexpensive medical text annotations, thereby explicitly establishing semantic representation and language correspondence for diffusion models. Concretely, TextDiff extracts intermediate activations of the Markov step of the reverse diffusion process in a pretrained diffusion model on large-scale natural images and learns additional expert knowledge by combining them with complementary and readily available diagnostic text information. TextDiff freezes the dual-branch multi-modal structure and mines the latent alignment of semantic features in diffusion models with diagnostic descriptions by only training the cross-attention mechanism and pixel classifier, making it possible to enhance semantic representation with inexpensive text. Extensive experiments on public QaTa-COVID19 and MoNuSeg datasets show that our TextDiff is significantly superior to the state-of-the-art multi-modal segmentation methods with only a few training samples.
Paper Structure (8 sections, 3 equations, 3 figures, 2 tables)

This paper contains 8 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the proposed TexDiff framework, where the Image Encoder is based on a pre-trained Diffusiondhariwal2021diffusion model to produce the high-level semantic information, while Clinical BioBERTalsentzer2019publicly serves as the Text Encoder. Multi-scale Cross-modal Attention leverages the knowledge of the text diagnostic annotation and images to be aligned for enhancing semantic representations.
  • Figure 2: Visual segmentation comparisons with regards to different datasets, detailed analysis is provided in Sec. \ref{['sota']}.
  • Figure 3: Evolution of the segmentation performance with regard to different blocks and steps of our proposed method on the two datasets, see Sec.\ref{['ab']} for more details.