SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation
Hritam Basak, Zhaozheng Yin
TL;DR
SemiDAViL tackles semantic segmentation under semi-supervised domain shift by fusing vision-language priors with dense language guidance and a dynamic class-balanced loss. The approach couples a vision-language pre-trained backbone with a novel Dense Language Guidance module, a consistency training scheme, and the DyCE loss to address misclassification, domain bias, and tail-class imbalance. Empirical results across GTA5/SYNTHIA to Cityscapes—and even a medical imbalanced dataset—show consistent gains over state-of-the-art methods, with pronounced improvements on rare classes and under low-label regimes. The work demonstrates that language-informed cross-modal representations and adaptive gradient weighting can robustify dense segmentation under challenging domain shifts, offering practical impact for real-world, data-scarce deployment. The combination of DL G, CT, and DyCE constitutes a versatile toolkit for future SSDA research and cross-domain semantic understanding.
Abstract
Domain Adaptation (DA) and Semi-supervised Learning (SSL) converge in Semi-supervised Domain Adaptation (SSDA), where the objective is to transfer knowledge from a source domain to a target domain using a combination of limited labeled target samples and abundant unlabeled target data. Although intuitive, a simple amalgamation of DA and SSL is suboptimal in semantic segmentation due to two major reasons: (1) previous methods, while able to learn good segmentation boundaries, are prone to confuse classes with similar visual appearance due to limited supervision; and (2) skewed and imbalanced training data distribution preferring source representation learning whereas impeding from exploring limited information about tailed classes. Language guidance can serve as a pivotal semantic bridge, facilitating robust class discrimination and mitigating visual ambiguities by leveraging the rich semantic relationships encoded in pre-trained language models to enhance feature representations across domains. Therefore, we propose the first language-guided SSDA setting for semantic segmentation in this work. Specifically, we harness the semantic generalization capabilities inherent in vision-language models (VLMs) to establish a synergistic framework within the SSDA paradigm. To address the inherent class-imbalance challenges in long-tailed distributions, we introduce class-balanced segmentation loss formulations that effectively regularize the learning process. Through extensive experimentation across diverse domain adaptation scenarios, our approach demonstrates substantial performance improvements over contemporary state-of-the-art (SoTA) methodologies. Code is available: \href{https://github.com/hritam-98/SemiDAViL}{GitHub}.
