Table of Contents
Fetching ...

Unknown Prompt, the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization

Mainak Singha, Ankit Jha, Shirsha Bose, Ashwin Nair, Moloud Abdar, Biplab Banerjee

TL;DR

Open Domain Generalization (ODG) requires classifiers to handle domain shifts and unseen categories without target-domain supervision. The authors propose ODG-CLIP, a CLIP-based framework that treats open-set DG as a C+1 task, uses a diffusion-model–generated unknown-class prompt for open samples, and employs domain-aware prompt learning to tailor representations across domains while enhancing CLIP embeddings. They introduce a semantic-consistency loss and a latent visual space guided by prompt tokens, achieving state-of-the-art results on six DG benchmarks with strong open-set performance. The method leverages a stable diffusion generator for high-quality pseudo-open data and demonstrates significant improvements over CNN-based and CLIP-based baselines, suggesting practical impact for robust open-world recognition across diverse domains.

Abstract

We delve into Open Domain Generalization (ODG), marked by domain and category shifts between training's labeled source and testing's unlabeled target domains. Existing solutions to ODG face limitations due to constrained generalizations of traditional CNN backbones and errors in detecting target open samples in the absence of prior knowledge. Addressing these pitfalls, we introduce ODG-CLIP, harnessing the semantic prowess of the vision-language model, CLIP. Our framework brings forth three primary innovations: Firstly, distinct from prevailing paradigms, we conceptualize ODG as a multi-class classification challenge encompassing both known and novel categories. Central to our approach is modeling a unique prompt tailored for detecting unknown class samples, and to train this, we employ a readily accessible stable diffusion model, elegantly generating proxy images for the open class. Secondly, aiming for domain-tailored classification (prompt) weights while ensuring a balance of precision and simplicity, we devise a novel visual stylecentric prompt learning mechanism. Finally, we infuse images with class-discriminative knowledge derived from the prompt space to augment the fidelity of CLIP's visual embeddings. We introduce a novel objective to safeguard the continuity of this infused semantic intel across domains, especially for the shared classes. Through rigorous testing on diverse datasets, covering closed and open-set DG contexts, ODG-CLIP demonstrates clear supremacy, consistently outpacing peers with performance boosts between 8%-16%. Code will be available at https://github.com/mainaksingha01/ODG-CLIP.

Unknown Prompt, the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization

TL;DR

Open Domain Generalization (ODG) requires classifiers to handle domain shifts and unseen categories without target-domain supervision. The authors propose ODG-CLIP, a CLIP-based framework that treats open-set DG as a C+1 task, uses a diffusion-model–generated unknown-class prompt for open samples, and employs domain-aware prompt learning to tailor representations across domains while enhancing CLIP embeddings. They introduce a semantic-consistency loss and a latent visual space guided by prompt tokens, achieving state-of-the-art results on six DG benchmarks with strong open-set performance. The method leverages a stable diffusion generator for high-quality pseudo-open data and demonstrates significant improvements over CNN-based and CLIP-based baselines, suggesting practical impact for robust open-world recognition across diverse domains.

Abstract

We delve into Open Domain Generalization (ODG), marked by domain and category shifts between training's labeled source and testing's unlabeled target domains. Existing solutions to ODG face limitations due to constrained generalizations of traditional CNN backbones and errors in detecting target open samples in the absence of prior knowledge. Addressing these pitfalls, we introduce ODG-CLIP, harnessing the semantic prowess of the vision-language model, CLIP. Our framework brings forth three primary innovations: Firstly, distinct from prevailing paradigms, we conceptualize ODG as a multi-class classification challenge encompassing both known and novel categories. Central to our approach is modeling a unique prompt tailored for detecting unknown class samples, and to train this, we employ a readily accessible stable diffusion model, elegantly generating proxy images for the open class. Secondly, aiming for domain-tailored classification (prompt) weights while ensuring a balance of precision and simplicity, we devise a novel visual stylecentric prompt learning mechanism. Finally, we infuse images with class-discriminative knowledge derived from the prompt space to augment the fidelity of CLIP's visual embeddings. We introduce a novel objective to safeguard the continuity of this infused semantic intel across domains, especially for the shared classes. Through rigorous testing on diverse datasets, covering closed and open-set DG contexts, ODG-CLIP demonstrates clear supremacy, consistently outpacing peers with performance boosts between 8%-16%. Code will be available at https://github.com/mainaksingha01/ODG-CLIP.
Paper Structure (21 sections, 3 equations, 8 figures, 22 tables)

This paper contains 21 sections, 3 equations, 8 figures, 22 tables.

Figures (8)

  • Figure 1: ODG-CLIP operates as a multi-class classifier leveraging prompt learning for effective management of known categories and outliers in an ODG context. Central to its methodology is a novel unknown-class prompt, designed for open-set samples and integrated with CLIP's unaltered image and text encoders, $\mathcal{F}_v$ and $\mathcal{F}_t$. For the training of unknown-class prompt weights, ODG-CLIP employs pseudo-unknown image generation via stable diffusion (SD)stablediffusion. Diverging from existing methods clipcocoopstylip, ODG-CLIP focuses on creating a refined latent visual space to improve visual embeddings and address domain disparities efficiently.
  • Figure 2: Model architecture of ODG-CLIP, which consists of three main components for designing a multi-class closed-open class classifier using prompt learning with a novel unknown-class prompt for the outliers. Firstly, we propose to generate pseudo-open samples using a pre-trained diffusion model by employing specialized positive and negative textual instructions. The combined images $\mathcal{D} \cup \mathcal{D}_{open}$ go through the prompt learning stage with specialized projectors ($\mathcal{F}_{dom}$), where two types of prompts are learned per image, one using domain+class information, and the other using only domain information. Their difference is used to obtain the latent visual representation $\tilde{x}$ for a given image $x$ conditioned on the class labels from $\mathcal{Y}_{aug}$, through $\mathcal{F}_{up}$ and $\mathcal{F}_v^{proj}$. The model is trained using $\mathcal{L}_{con} + \mathcal{L}_{sem}$ given all the source domains in $\mathcal{D} \cup \mathcal{D}_{open}$. During inference, we create the latent representations for a target image with respect to all the class labels, and the class maximizing Eq. \ref{['eq:prob']} is selected.
  • Figure 3: Top: Ablation on the average cosine similarity values of $\hat{x}$ on four shared classes across the domains in PACS, Below: Openness analysis of different methods on Office-Home.
  • Figure 4: Comparison of ODG-CLIP with Cumix cumix and OpenGAN osr2 on inception score for the generated open-set samples.
  • Figure 5: GFLOPs comparison of different methods.
  • ...and 3 more figures