Table of Contents
Fetching ...

SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation

Hritam Basak, Zhaozheng Yin

TL;DR

SemiDAViL tackles semantic segmentation under semi-supervised domain shift by fusing vision-language priors with dense language guidance and a dynamic class-balanced loss. The approach couples a vision-language pre-trained backbone with a novel Dense Language Guidance module, a consistency training scheme, and the DyCE loss to address misclassification, domain bias, and tail-class imbalance. Empirical results across GTA5/SYNTHIA to Cityscapes—and even a medical imbalanced dataset—show consistent gains over state-of-the-art methods, with pronounced improvements on rare classes and under low-label regimes. The work demonstrates that language-informed cross-modal representations and adaptive gradient weighting can robustify dense segmentation under challenging domain shifts, offering practical impact for real-world, data-scarce deployment. The combination of DL G, CT, and DyCE constitutes a versatile toolkit for future SSDA research and cross-domain semantic understanding.

Abstract

Domain Adaptation (DA) and Semi-supervised Learning (SSL) converge in Semi-supervised Domain Adaptation (SSDA), where the objective is to transfer knowledge from a source domain to a target domain using a combination of limited labeled target samples and abundant unlabeled target data. Although intuitive, a simple amalgamation of DA and SSL is suboptimal in semantic segmentation due to two major reasons: (1) previous methods, while able to learn good segmentation boundaries, are prone to confuse classes with similar visual appearance due to limited supervision; and (2) skewed and imbalanced training data distribution preferring source representation learning whereas impeding from exploring limited information about tailed classes. Language guidance can serve as a pivotal semantic bridge, facilitating robust class discrimination and mitigating visual ambiguities by leveraging the rich semantic relationships encoded in pre-trained language models to enhance feature representations across domains. Therefore, we propose the first language-guided SSDA setting for semantic segmentation in this work. Specifically, we harness the semantic generalization capabilities inherent in vision-language models (VLMs) to establish a synergistic framework within the SSDA paradigm. To address the inherent class-imbalance challenges in long-tailed distributions, we introduce class-balanced segmentation loss formulations that effectively regularize the learning process. Through extensive experimentation across diverse domain adaptation scenarios, our approach demonstrates substantial performance improvements over contemporary state-of-the-art (SoTA) methodologies. Code is available: \href{https://github.com/hritam-98/SemiDAViL}{GitHub}.

SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation

TL;DR

SemiDAViL tackles semantic segmentation under semi-supervised domain shift by fusing vision-language priors with dense language guidance and a dynamic class-balanced loss. The approach couples a vision-language pre-trained backbone with a novel Dense Language Guidance module, a consistency training scheme, and the DyCE loss to address misclassification, domain bias, and tail-class imbalance. Empirical results across GTA5/SYNTHIA to Cityscapes—and even a medical imbalanced dataset—show consistent gains over state-of-the-art methods, with pronounced improvements on rare classes and under low-label regimes. The work demonstrates that language-informed cross-modal representations and adaptive gradient weighting can robustify dense segmentation under challenging domain shifts, offering practical impact for real-world, data-scarce deployment. The combination of DL G, CT, and DyCE constitutes a versatile toolkit for future SSDA research and cross-domain semantic understanding.

Abstract

Domain Adaptation (DA) and Semi-supervised Learning (SSL) converge in Semi-supervised Domain Adaptation (SSDA), where the objective is to transfer knowledge from a source domain to a target domain using a combination of limited labeled target samples and abundant unlabeled target data. Although intuitive, a simple amalgamation of DA and SSL is suboptimal in semantic segmentation due to two major reasons: (1) previous methods, while able to learn good segmentation boundaries, are prone to confuse classes with similar visual appearance due to limited supervision; and (2) skewed and imbalanced training data distribution preferring source representation learning whereas impeding from exploring limited information about tailed classes. Language guidance can serve as a pivotal semantic bridge, facilitating robust class discrimination and mitigating visual ambiguities by leveraging the rich semantic relationships encoded in pre-trained language models to enhance feature representations across domains. Therefore, we propose the first language-guided SSDA setting for semantic segmentation in this work. Specifically, we harness the semantic generalization capabilities inherent in vision-language models (VLMs) to establish a synergistic framework within the SSDA paradigm. To address the inherent class-imbalance challenges in long-tailed distributions, we introduce class-balanced segmentation loss formulations that effectively regularize the learning process. Through extensive experimentation across diverse domain adaptation scenarios, our approach demonstrates substantial performance improvements over contemporary state-of-the-art (SoTA) methodologies. Code is available: \href{https://github.com/hritam-98/SemiDAViL}{GitHub}.

Paper Structure

This paper contains 24 sections, 10 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Major contributions of SemiDAVil: (1) We propose the first language-guided SSDA framework for semantic segmentation, (2) Utilizing spatial context via dense language guidance (DLG) improves segmentation performance, (3) Our proposed DyCE loss dynamically reweighs imbalanced class distributions, resulting in precise segmentation of minority classes.
  • Figure 2: Overview of SemiDAViL: We leverage Vision-Language (VL) Pre-training (top) to initialize the language encoder $\mathcal{E}_\mathcal{L}$ and vision encoders $\mathcal{E}_\mathcal{V}^{\{\mathcal{S, T}\}}$ in a semi-supervised setting (bottom), where $\mathcal{S}$ and $\mathcal{T}$ denote the student and teacher branches, respectively. To bridge image-level VL features for dense pixel-level tasks, we utilize a captioning model $\mathcal{C}$ to generate text descriptions of images and a Dense Language Guidance (DLG) module. The framework is trained with a supervised loss $\mathcal{L}_{DyCE}$ for labeled data and a consistency loss $\mathcal{L}_{\mathcal{CT}}$ for unlabeled data.
  • Figure 3: Overall architecture of our proposed DLG module: it is based on dense similarity maps of the vision and text embeddings. More details are provided in \ref{['subsection:dense-language-guidance']}.
  • Figure 4: Qualitative segmentation performance of SemiDAViL with and without DyCE loss on 100 labeled target data.
  • Figure 5: Comparative analysis of multiple vision-language attention mechanisms: our Dense Language Guidance (DLG), VLGA hoyer2025semivl, WPCT li2022language, and Cross-attention chen2021crossvit on (a) GTA5$\to$Cityscapes and (b) Synthia$\to$Cityscapes using different labeled target annotations.
  • ...and 3 more figures