Table of Contents
Fetching ...

Closing the Confusion Loop: CLIP-Guided Alignment for Source-Free Domain Adaptation

Shanshan Wang, Ziying Feng, Xiaozheng Shen, Xun Yang, Pichao Wang, Zhenwei He, Xingyi Zhang

TL;DR

This paper addresses Source-Free Domain Adaptation (SFDA) by proposing CGA, a three-stage framework that explicitly models and mitigates asymmetric inter-class confusion using CLIP-guided prompts. CGA detects directional confusion (MCA), encodes ambiguity with multi-prototype textual prompts (MCC), and aligns confusion-aware feature banks through contrastive learning (FAM), supervised by a composite loss that integrates CLIP and source-model signals. The approach achieves state-of-the-art results on challenging SFDA benchmarks (VisDA, DomainNet-126, Office-Home, Office-31), with pronounced gains in fine-grained and confusion-prone settings, and provides extensive ablations and visual analyses to validate its components. This work demonstrates the practical impact of explicitly modeling class confusion for robust source-free adaptation and offers a scalable framework for integrating CLIP semantics into target-space learning while preserving data privacy.

Abstract

Source-Free Domain Adaptation (SFDA) tackles the problem of adapting a pre-trained source model to an unlabeled target domain without accessing any source data, which is quite suitable for the field of data security. Although recent advances have shown that pseudo-labeling strategies can be effective, they often fail in fine-grained scenarios due to subtle inter-class similarities. A critical but underexplored issue is the presence of asymmetric and dynamic class confusion, where visually similar classes are unequally and inconsistently misclassified by the source model. Existing methods typically ignore such confusion patterns, leading to noisy pseudo-labels and poor target discrimination. To address this, we propose CLIP-Guided Alignment(CGA), a novel framework that explicitly models and mitigates class confusion in SFDA. Generally, our method consists of three parts: (1) MCA: detects first directional confusion pairs by analyzing the predictions of the source model in the target domain; (2) MCC: leverages CLIP to construct confusion-aware textual prompts (e.g. a truck that looks like a bus), enabling more context-sensitive pseudo-labeling; and (3) FAM: builds confusion-guided feature banks for both CLIP and the source model and aligns them using contrastive learning to reduce ambiguity in the representation space. Extensive experiments on various datasets demonstrate that CGA consistently outperforms state-of-the-art SFDA methods, with especially notable gains in confusion-prone and fine-grained scenarios. Our results highlight the importance of explicitly modeling inter-class confusion for effective source-free adaptation. Our code can be find at https://github.com/soloiro/CGA

Closing the Confusion Loop: CLIP-Guided Alignment for Source-Free Domain Adaptation

TL;DR

This paper addresses Source-Free Domain Adaptation (SFDA) by proposing CGA, a three-stage framework that explicitly models and mitigates asymmetric inter-class confusion using CLIP-guided prompts. CGA detects directional confusion (MCA), encodes ambiguity with multi-prototype textual prompts (MCC), and aligns confusion-aware feature banks through contrastive learning (FAM), supervised by a composite loss that integrates CLIP and source-model signals. The approach achieves state-of-the-art results on challenging SFDA benchmarks (VisDA, DomainNet-126, Office-Home, Office-31), with pronounced gains in fine-grained and confusion-prone settings, and provides extensive ablations and visual analyses to validate its components. This work demonstrates the practical impact of explicitly modeling class confusion for robust source-free adaptation and offers a scalable framework for integrating CLIP semantics into target-space learning while preserving data privacy.

Abstract

Source-Free Domain Adaptation (SFDA) tackles the problem of adapting a pre-trained source model to an unlabeled target domain without accessing any source data, which is quite suitable for the field of data security. Although recent advances have shown that pseudo-labeling strategies can be effective, they often fail in fine-grained scenarios due to subtle inter-class similarities. A critical but underexplored issue is the presence of asymmetric and dynamic class confusion, where visually similar classes are unequally and inconsistently misclassified by the source model. Existing methods typically ignore such confusion patterns, leading to noisy pseudo-labels and poor target discrimination. To address this, we propose CLIP-Guided Alignment(CGA), a novel framework that explicitly models and mitigates class confusion in SFDA. Generally, our method consists of three parts: (1) MCA: detects first directional confusion pairs by analyzing the predictions of the source model in the target domain; (2) MCC: leverages CLIP to construct confusion-aware textual prompts (e.g. a truck that looks like a bus), enabling more context-sensitive pseudo-labeling; and (3) FAM: builds confusion-guided feature banks for both CLIP and the source model and aligns them using contrastive learning to reduce ambiguity in the representation space. Extensive experiments on various datasets demonstrate that CGA consistently outperforms state-of-the-art SFDA methods, with especially notable gains in confusion-prone and fine-grained scenarios. Our results highlight the importance of explicitly modeling inter-class confusion for effective source-free adaptation. Our code can be find at https://github.com/soloiro/CGA
Paper Structure (20 sections, 17 equations, 8 figures, 6 tables, 2 algorithms)

This paper contains 20 sections, 17 equations, 8 figures, 6 tables, 2 algorithms.

Figures (8)

  • Figure 1: During conventional mutual learning, cross-domain misconceptions from the source model contaminate the generalized knowledge in CLIP. Our method which employ confusing-class prompts to induce biased knowledge in CLIP can prevent contamination from cross-domain misconceptions.
  • Figure 2: (a). The modules highlighted in red represent the three core components of our model: MCA (Model Class Confusion Analysis Module), MCC (Multi-Prototype Confused CLIP), and FAM (Feature Space Alignment Module). (b). The detailed workflow of CCF (Constructing a Confusion-Aware Feature Center Bank). (c). The specific structure and data flow of FAM (Feature space Align Module).
  • Figure 3: The Grad-CAM visualizations of Source, COWA, CGA w/o MCC and CGA models trained in Visda.
  • Figure 4: Feature distribution visualization comparison on transfer task Ar→Cl in Office-Home by t-SNE.
  • Figure 5: (a) Time Comparison Between Introduced Pre-operations and Training Process (b) Time Consumption of Feature Alignment Module During Training.
  • ...and 3 more figures