Table of Contents
Fetching ...

TCSA-UDA: Text-Driven Cross-Semantic Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation

Lalit Maurya, Honghai Liu, Reyer Zwiggelaar

TL;DR

TCSA-UDA tackles cross-modality domain shifts in medical image segmentation by introducing text-driven cross-semantic alignment that couples vision and language through a vision-language covariance loss, a transformer-based modality-aware fusion, adversarial structural alignment, and semantic prototype alignment. The VLCoL loss aligns class-wise pixel features with text embeddings via covariance similarity, while EMA-based prototype alignment reduces residual cross-domain discrepancies, all trained with source supervision and target-domain adaptation. Across cardiac, abdominal, and brain-tumor segmentation benchmarks, the approach consistently surpasses state-of-the-art UDAs, achieving higher Dice scores and lower ASD in both CT/MRI directions and demonstrating superior semantic and structural consistency as evidenced by GradCAM analyses. This framework suggests a practical path toward robust, language-guided domain adaptation in clinical imaging, with potential for further gains by leveraging long-form domain descriptors from large language models.

Abstract

Unsupervised domain adaptation for medical image segmentation remains a significant challenge due to substantial domain shifts across imaging modalities, such as CT and MRI. While recent vision-language representation learning methods have shown promise, their potential in UDA segmentation tasks remains underexplored. To address this gap, we propose TCSA-UDA, a Text-driven Cross-Semantic Alignment framework that leverages domain-invariant textual class descriptions to guide visual representation learning. Our approach introduces a vision-language covariance cosine loss to directly align image encoder features with inter-class textual semantic relations, encouraging semantically meaningful and modality-invariant feature representations. Additionally, we incorporate a prototype alignment module that aligns class-wise pixel-level feature distributions across domains using high-level semantic prototypes. This mitigates residual category-level discrepancies and enhances cross-modal consistency. Extensive experiments on challenging cross-modality cardiac, abdominal, and brain tumor segmentation benchmarks demonstrate that our TCSA-UDA framework significantly reduces domain shift and consistently outperforms state-of-the-art UDA methods, establishing a new paradigm for integrating language-driven semantics into domain-adaptive medical image analysis.

TCSA-UDA: Text-Driven Cross-Semantic Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation

TL;DR

TCSA-UDA tackles cross-modality domain shifts in medical image segmentation by introducing text-driven cross-semantic alignment that couples vision and language through a vision-language covariance loss, a transformer-based modality-aware fusion, adversarial structural alignment, and semantic prototype alignment. The VLCoL loss aligns class-wise pixel features with text embeddings via covariance similarity, while EMA-based prototype alignment reduces residual cross-domain discrepancies, all trained with source supervision and target-domain adaptation. Across cardiac, abdominal, and brain-tumor segmentation benchmarks, the approach consistently surpasses state-of-the-art UDAs, achieving higher Dice scores and lower ASD in both CT/MRI directions and demonstrating superior semantic and structural consistency as evidenced by GradCAM analyses. This framework suggests a practical path toward robust, language-guided domain adaptation in clinical imaging, with potential for further gains by leveraging long-form domain descriptors from large language models.

Abstract

Unsupervised domain adaptation for medical image segmentation remains a significant challenge due to substantial domain shifts across imaging modalities, such as CT and MRI. While recent vision-language representation learning methods have shown promise, their potential in UDA segmentation tasks remains underexplored. To address this gap, we propose TCSA-UDA, a Text-driven Cross-Semantic Alignment framework that leverages domain-invariant textual class descriptions to guide visual representation learning. Our approach introduces a vision-language covariance cosine loss to directly align image encoder features with inter-class textual semantic relations, encouraging semantically meaningful and modality-invariant feature representations. Additionally, we incorporate a prototype alignment module that aligns class-wise pixel-level feature distributions across domains using high-level semantic prototypes. This mitigates residual category-level discrepancies and enhances cross-modal consistency. Extensive experiments on challenging cross-modality cardiac, abdominal, and brain tumor segmentation benchmarks demonstrate that our TCSA-UDA framework significantly reduces domain shift and consistently outperforms state-of-the-art UDA methods, establishing a new paradigm for integrating language-driven semantics into domain-adaptive medical image analysis.

Paper Structure

This paper contains 21 sections, 14 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Class-specific visual features (squares and circles) from source and target domains are aligned with domain-agnostic text embeddings (stars) via covariance matching. Red bidirectional arrows represent cross-modal covariance learning, encouraging semantic consistency across domains in a shared feature space
  • Figure 2: Schematic representation of the proposed TCSA-UDA framework, comprising: (a) text-driven semantic covariance learning via $\mathcal{L}_{\text{VLCoL}}$, (b) text-driven supervised learning on the source domain via $\mathcal{L}_{\text{seg}}$, (c) adversarial learning using both text-driven and auxiliary predictions via $\mathcal{L}_{\text{adv}}$, and (d) class-wise cross-domain semantic alignment via $\mathcal{L}_{\text{proto}}$.
  • Figure 3: Qualitative segmentation results by different comparison algorithms. The blue, green, yellow and red represent the AA, LAC, LVC and MYO, respectively
  • Figure 4: Qualitative results of Abdominal organ segmentation by different comparison algorithms. The blue, green, yellow and red represent the Spleen, RK, LK and Liver, respectively
  • Figure 5: Qualitative results of brain tumor segmentation by different comparison algorithms
  • ...and 1 more figures