COSMo: CLIP Talks on Open-Set Multi-Target Domain Adaptation

Munish Monga; Sachin Kumar Giroh; Ankit Jha; Mainak Singha; Biplab Banerjee; Jocelyn Chanussot

COSMo: CLIP Talks on Open-Set Multi-Target Domain Adaptation

Munish Monga, Sachin Kumar Giroh, Ankit Jha, Mainak Singha, Biplab Banerjee, Jocelyn Chanussot

TL;DR

This work tackles Open-set Multi-Target Domain Adaptation (OSMTDA), where a model trained on a single labeled source domain must recognize known classes and identify unknowns across multiple unlabeled target domains. COSMo addresses this by embedding domain-specific bias into prompts and maintaining separate, domain-agnostic prompts for known and unknown classes within CLIP, aided by a Domain-Specific Bias Network (DSBN). The method demonstrates strong improvements in the harmonic mean of known and unknown accuracy (HOS) across Office-31, Office-Home, and Mini-DomainNet, reflecting robust adaptation to both domain shifts and open-set class expansion. The approach offers a practical pathway for real-world deployment of cross-domain recognition systems with limited labeling, and points to future work in extending to semantic segmentation and object detection tasks.

Abstract

Multi-Target Domain Adaptation (MTDA) entails learning domain-invariant information from a single source domain and applying it to multiple unlabeled target domains. Yet, existing MTDA methods predominantly focus on addressing domain shifts within visual features, often overlooking semantic features and struggling to handle unknown classes, resulting in what is known as Open-Set (OS) MTDA. While large-scale vision-language foundation models like CLIP show promise, their potential for MTDA remains largely unexplored. This paper introduces COSMo, a novel method that learns domain-agnostic prompts through source domain-guided prompt learning to tackle the MTDA problem in the prompt space. By leveraging a domain-specific bias network and separate prompts for known and unknown classes, COSMo effectively adapts across domain and class shifts. To the best of our knowledge, COSMo is the first method to address Open-Set Multi-Target DA (OSMTDA), offering a more realistic representation of real-world scenarios and addressing the challenges of both open-set and multi-target DA. COSMo demonstrates an average improvement of $5.1\%$ across three challenging datasets: Mini-DomainNet, Office-31, and Office-Home, compared to other related DA methods adapted to operate within the OSMTDA setting. Code is available at: https://github.com/munish30monga/COSMo

COSMo: CLIP Talks on Open-Set Multi-Target Domain Adaptation

TL;DR

Abstract

across three challenging datasets: Mini-DomainNet, Office-31, and Office-Home, compared to other related DA methods adapted to operate within the OSMTDA setting. Code is available at: https://github.com/munish30monga/COSMo

Paper Structure (18 sections, 3 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 18 sections, 3 equations, 6 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Open-set Domain Adaptation
Multi-target domain adaptation
Vision-language models and prompt learning
Methodology
Problem formulation
Overview of COSMo
Model optimization
Experiments
Comparison to the literature
Ablation studies
Conclusion
Acknowledgements
Contents of the supplementary materials
...and 3 more sections

Figures (6)

Figure 1: OSMTDA differs from traditional DA settings like Open-set DA by handling unknown classes across diverse target domains, while Multi-target DA transfers knowledge from a single labeled source to multiple unlabeled targets. unk denotes the unknown class.
Figure 2: The architecture overview of COSMo, where $\mathcal{F}_v$ and $\mathcal{F}_t$ are the frozen pretrained CLIP's image and text encoders, respectively. $P_{kwn}$ and $P_{unk}$ denote the prompts for the known and unknown classes, respectively. $\mathcal{B}_\theta(\cdot)$ represents the domain specific bias network, which generates the domain-bias context tokens $\beta$. Best view in color.
Figure 3: t-SNE visualizations on the Office31 Dataset with Amazon as the source domain. Colored dots represent known classes in the source domain, while black triangles denote target domain samples. For COSMo, text embeddings are used, while features from the penultimate layer are used for the other models.
Figure 4: Effect of varying the entropy regularization parameter $\lambda$ on the Office-31 dataset.
Figure 5: Impact of having separate known prompts $P_{kwn}$ and unknown prompts $P_{unk}$. Here 'N' represents no separate prompt, and 'Y' represents that the separate prompts are used.
...and 1 more figures

COSMo: CLIP Talks on Open-Set Multi-Target Domain Adaptation

TL;DR

Abstract

COSMo: CLIP Talks on Open-Set Multi-Target Domain Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)