Table of Contents
Fetching ...

Language Guided Domain Generalized Medical Image Segmentation

Shahina Kunhimon, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

TL;DR

This paper tackles single source domain generalization for medical image segmentation under domain shifts by introducing a text-guided contrastive feature alignment (TGCFA) that couples a segmentation network with a frozen CLIP text encoder and ChatGPT-generated class descriptions to ground visual features in linguistic context. The training objective combines the segmentation loss and a feature-level alignment loss, $L = L_{Seg} + L_{Align}$, enabling multi-modal alignment between image features and text embeddings. Experiments across cross-modality, cross-sequence, and cross-site datasets show consistent gains over strong baselines, improving boundary delineation and robustness. The approach is practical for clinical deployment with publicly available code and weights.

Abstract

Single source domain generalization (SDG) holds promise for more reliable and consistent image segmentation across real-world clinical settings particularly in the medical domain, where data privacy and acquisition cost constraints often limit the availability of diverse datasets. Depending solely on visual features hampers the model's capacity to adapt effectively to various domains, primarily because of the presence of spurious correlations and domain-specific characteristics embedded within the image features. Incorporating text features alongside visual features is a potential solution to enhance the model's understanding of the data, as it goes beyond pixel-level information to provide valuable context. Textual cues describing the anatomical structures, their appearances, and variations across various imaging modalities can guide the model in domain adaptation, ultimately contributing to more robust and consistent segmentation. In this paper, we propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features to learn a more robust feature representation. We assess the effectiveness of our text-guided contrastive feature alignment technique in various scenarios, including cross-modality, cross-sequence, and cross-site settings for different segmentation tasks. Our approach achieves favorable performance against existing methods in literature. Our code and model weights are available at https://github.com/ShahinaKK/LG_SDG.git.

Language Guided Domain Generalized Medical Image Segmentation

TL;DR

This paper tackles single source domain generalization for medical image segmentation under domain shifts by introducing a text-guided contrastive feature alignment (TGCFA) that couples a segmentation network with a frozen CLIP text encoder and ChatGPT-generated class descriptions to ground visual features in linguistic context. The training objective combines the segmentation loss and a feature-level alignment loss, , enabling multi-modal alignment between image features and text embeddings. Experiments across cross-modality, cross-sequence, and cross-site datasets show consistent gains over strong baselines, improving boundary delineation and robustness. The approach is practical for clinical deployment with publicly available code and weights.

Abstract

Single source domain generalization (SDG) holds promise for more reliable and consistent image segmentation across real-world clinical settings particularly in the medical domain, where data privacy and acquisition cost constraints often limit the availability of diverse datasets. Depending solely on visual features hampers the model's capacity to adapt effectively to various domains, primarily because of the presence of spurious correlations and domain-specific characteristics embedded within the image features. Incorporating text features alongside visual features is a potential solution to enhance the model's understanding of the data, as it goes beyond pixel-level information to provide valuable context. Textual cues describing the anatomical structures, their appearances, and variations across various imaging modalities can guide the model in domain adaptation, ultimately contributing to more robust and consistent segmentation. In this paper, we propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features to learn a more robust feature representation. We assess the effectiveness of our text-guided contrastive feature alignment technique in various scenarios, including cross-modality, cross-sequence, and cross-site settings for different segmentation tasks. Our approach achieves favorable performance against existing methods in literature. Our code and model weights are available at https://github.com/ShahinaKK/LG_SDG.git.
Paper Structure (7 sections, 5 equations, 4 figures, 3 tables)

This paper contains 7 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The proposed training pipeline consists of (i) Segmentation network: an encoder-decoder model (ii) CLIP Text Encoder: which is frozen and takes text descriptions from ChatGPT as input to create label-wise text embeddings and (iii) Text-Guided Contrastive Feature Alignment Module: which enhances the alignment between the image and text encoder representations via our feature-level contrastive loss.
  • Figure 2: Qualitative Results (Cross-Site Fundus): CCSDG struggles to accurately define label boundaries (red dashed box), while our approach enhances boundary definition.
  • Figure 3: Qualitative results (Cross-Modality Abdomen): Domain-specific appearance shift :- Liver (blue dashed box) appears dark in MRI (top) and bright in CT (bottom) images. Our approach enhances the SLAug baseline by reducing miss classification (red dashed box ) and refining organ boundaries.
  • Figure 4: Qualitative results (Cross-Sequence Cardiac): Our approach outperforms the SLAug baseline in delineating organ boundaries (highlighted by red dashed boxes).