Table of Contents
Fetching ...

OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP

Mohamad Hassan N C, Divyam Gupta, Mainak Singha, Sai Bhargav Rongali, Ankit Jha, Muhammad Haris Khan, Biplab Banerjee

TL;DR

OSLoPrompt tackles LSOSDG by uniting low-shot domain generalization with open-set handling in CLIP. It introduces domain-agnostic prompt learning augmented by image-to-attribute cross-attention and learnable visual prompts, plus a controlled synthesis of fine-grained pseudo-open samples via GPT-4o and Stable Diffusion to train an Unknown class. The approach is validated on five benchmarks, achieving state-of-the-art HScores and showing robust gains from the ablations of domain-specific versus domain-agnostic prompts and loss terms, with significant improvements over strong baselines. The work provides a practical framework for robust open-world recognition under scarce supervision, with potential extensions to broader open-world and structured prediction tasks.

Abstract

We introduce Low-Shot Open-Set Domain Generalization (LSOSDG), a novel paradigm unifying low-shot learning with open-set domain generalization (ODG). While prompt-based methods using models like CLIP have advanced DG, they falter in low-data regimes (e.g., 1-shot) and lack precision in detecting open-set samples with fine-grained semantics related to training classes. To address these challenges, we propose OSLOPROMPT, an advanced prompt-learning framework for CLIP with two core innovations. First, to manage limited supervision across source domains and improve DG, we introduce a domain-agnostic prompt-learning mechanism that integrates adaptable domain-specific cues and visually guided semantic attributes through a novel cross-attention module, besides being supported by learnable domain- and class-generic visual prompts to enhance cross-modal adaptability. Second, to improve outlier rejection during inference, we classify unfamiliar samples as "unknown" and train specialized prompts with systematically synthesized pseudo-open samples that maintain fine-grained relationships to known classes, generated through a targeted query strategy with off-the-shelf foundation models. This strategy enhances feature learning, enabling our model to detect open samples with varied granularity more effectively. Extensive evaluations across five benchmarks demonstrate that OSLOPROMPT establishes a new state-of-the-art in LSOSDG, significantly outperforming existing methods.

OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP

TL;DR

OSLoPrompt tackles LSOSDG by uniting low-shot domain generalization with open-set handling in CLIP. It introduces domain-agnostic prompt learning augmented by image-to-attribute cross-attention and learnable visual prompts, plus a controlled synthesis of fine-grained pseudo-open samples via GPT-4o and Stable Diffusion to train an Unknown class. The approach is validated on five benchmarks, achieving state-of-the-art HScores and showing robust gains from the ablations of domain-specific versus domain-agnostic prompts and loss terms, with significant improvements over strong baselines. The work provides a practical framework for robust open-world recognition under scarce supervision, with potential extensions to broader open-world and structured prediction tasks.

Abstract

We introduce Low-Shot Open-Set Domain Generalization (LSOSDG), a novel paradigm unifying low-shot learning with open-set domain generalization (ODG). While prompt-based methods using models like CLIP have advanced DG, they falter in low-data regimes (e.g., 1-shot) and lack precision in detecting open-set samples with fine-grained semantics related to training classes. To address these challenges, we propose OSLOPROMPT, an advanced prompt-learning framework for CLIP with two core innovations. First, to manage limited supervision across source domains and improve DG, we introduce a domain-agnostic prompt-learning mechanism that integrates adaptable domain-specific cues and visually guided semantic attributes through a novel cross-attention module, besides being supported by learnable domain- and class-generic visual prompts to enhance cross-modal adaptability. Second, to improve outlier rejection during inference, we classify unfamiliar samples as "unknown" and train specialized prompts with systematically synthesized pseudo-open samples that maintain fine-grained relationships to known classes, generated through a targeted query strategy with off-the-shelf foundation models. This strategy enhances feature learning, enabling our model to detect open samples with varied granularity more effectively. Extensive evaluations across five benchmarks demonstrate that OSLOPROMPT establishes a new state-of-the-art in LSOSDG, significantly outperforming existing methods.

Paper Structure

This paper contains 21 sections, 13 equations, 7 figures, 16 tables, 1 algorithm.

Figures (7)

  • Figure 1: Harmonic score (H-score) (between known and novel class performances) comparisons of various CLIP-based DG/ODG/open-set recognition techniques versus our approach in LSOSDG setting with one-training example per known class, demonstrating the improved performances of OSLoPrompt.
  • Figure 2: t-SNE tsne of known-class and pseudo-open samples generated by ODG-CLIP odgclip (left) and our method (right). Our approach produces fine-grained pseudo-open samples, creating a sharper closed-open class boundary and enhancing feature learning, resulting in an improvement over odgclip on Mini-DomainNet domainnet (Table \ref{['tab:combined_ablation']}), significantly boosting open-set detection.
  • Figure 3: Proposed prompt learning: We develop a novel strategy for learning domain-agnostic prompts with tokens $\{\nu_{1:q:\mathcal{M}}\}$, inheriting context from source-specific prompts enriched with image-to attributes encodings. Some tokens also integrate knowledge from visual prompts spanning all training domains and classes. We differ considerably from the DG literature.
  • Figure 4: Working principles of OSLoPrompt. (a) Fine-grained pseudo-open samples $\mathcal{D}^{\text{open}}$ are generated using stable diffusion with pseudo-open class names $\mathcal{C}^{\text{open}}$ from GPT-4o. (b) GPT-4o generates attributes for each class in $\mathcal{C}$. (c)OSLoPrompt learns domain-agnostic prompts using tokens ${\nu_{1:\mathcal{M}}}$. The first $q$ tokens follow coop, while tokens $q+1$ to $\mathcal{M}$ are initialized via learnable visual prompts, and transformed through a projector $\text{Proj}_{vt}$. Domain-agnostic prompts are regularized by domain-specific prompts enhanced with visually-guided semantic attributes, encoded through a cross-attention module with parameters ($\mathbf{w_k}, \mathbf{w_v}, \mathbf{w_q}$). The model is trained with a context alignment loss $\mathcal{L}_{\text{align}}$, along with visual-textual classification losses, handling known class samples for domain-specific prompts with $\mathcal{D}$ ($\mathcal{L}_{\text{ce}}^{\text{dom-spec}}$) and both known and pseudo-open class samples for domain-agnostic prompts with $\mathcal{D}^{\text{aug}}$ ($\mathcal{L}_{\text{ce}}^{\text{dom-gen}}$).
  • Figure 5: (a) Comparison of the cosine similarity between image and prompt embeddings from $\mathcal{F}_v$ and $\mathcal{F}_t$ under CLIP image feature conditioning and our proposed semantic attribute-driven encoding on the domain-specific prompts on PACS, showing improved image-prompt alignment with our approach. (b)Sensitivity of ODG-CLIP odgclip and OSLoPromptto the number of training samples per class on PACS. (c)Openness sensitivity of OSLoPrompt and ODG-CLIP in the 1-shot Office-Home case for different known and novel class ratios. (d) Comparison of trainable parameters between ODG-CLIP and our method.
  • ...and 2 more figures