Closing the Confusion Loop: CLIP-Guided Alignment for Source-Free Domain Adaptation
Shanshan Wang, Ziying Feng, Xiaozheng Shen, Xun Yang, Pichao Wang, Zhenwei He, Xingyi Zhang
TL;DR
This paper addresses Source-Free Domain Adaptation (SFDA) by proposing CGA, a three-stage framework that explicitly models and mitigates asymmetric inter-class confusion using CLIP-guided prompts. CGA detects directional confusion (MCA), encodes ambiguity with multi-prototype textual prompts (MCC), and aligns confusion-aware feature banks through contrastive learning (FAM), supervised by a composite loss that integrates CLIP and source-model signals. The approach achieves state-of-the-art results on challenging SFDA benchmarks (VisDA, DomainNet-126, Office-Home, Office-31), with pronounced gains in fine-grained and confusion-prone settings, and provides extensive ablations and visual analyses to validate its components. This work demonstrates the practical impact of explicitly modeling class confusion for robust source-free adaptation and offers a scalable framework for integrating CLIP semantics into target-space learning while preserving data privacy.
Abstract
Source-Free Domain Adaptation (SFDA) tackles the problem of adapting a pre-trained source model to an unlabeled target domain without accessing any source data, which is quite suitable for the field of data security. Although recent advances have shown that pseudo-labeling strategies can be effective, they often fail in fine-grained scenarios due to subtle inter-class similarities. A critical but underexplored issue is the presence of asymmetric and dynamic class confusion, where visually similar classes are unequally and inconsistently misclassified by the source model. Existing methods typically ignore such confusion patterns, leading to noisy pseudo-labels and poor target discrimination. To address this, we propose CLIP-Guided Alignment(CGA), a novel framework that explicitly models and mitigates class confusion in SFDA. Generally, our method consists of three parts: (1) MCA: detects first directional confusion pairs by analyzing the predictions of the source model in the target domain; (2) MCC: leverages CLIP to construct confusion-aware textual prompts (e.g. a truck that looks like a bus), enabling more context-sensitive pseudo-labeling; and (3) FAM: builds confusion-guided feature banks for both CLIP and the source model and aligns them using contrastive learning to reduce ambiguity in the representation space. Extensive experiments on various datasets demonstrate that CGA consistently outperforms state-of-the-art SFDA methods, with especially notable gains in confusion-prone and fine-grained scenarios. Our results highlight the importance of explicitly modeling inter-class confusion for effective source-free adaptation. Our code can be find at https://github.com/soloiro/CGA
