Table of Contents
Fetching ...

Latent Domain Prompt Learning for Vision-Language Models

Zhixing Li, Arsham Gholamzadeh Khoee, Yinan Yu

TL;DR

This work tackles domain generalization for vision–language models without relying on explicit domain labels. It introduces Latent Domain Prompt Fusion (LDPF), a framework that discovers latent domains through adversarial feature extraction and k-means clustering, and then fuses domain-specific soft prompts using a domain-similarity-based weighting to improve cross-domain alignment. The method employs a two-stage prompt learning regime with losses $L_{dsp}$, $L_{dap}$, and $L_{adv}$, and a total objective $\mathcal{L}=L_{dsp}+\lambda(L_{dap}-L_{adv})$, while keeping the encoders frozen. Across four benchmarks, LDPF yields consistent gains over domain-label-free baselines and demonstrates competitive performance relative to methods that use explicit domain labels, though an upper-bound analysis reveals limitations of simple fusion in highly specialized domains. The findings highlight the practical potential of unlabeled latent-domain adaptation for robust vision–language understanding and point to future work on more expressive fusion or gating mechanisms to fully exploit promoter complementarity.

Abstract

The objective of domain generalization (DG) is to enable models to be robust against domain shift. DG is crucial for deploying vision-language models (VLMs) in real-world applications, yet most existing methods rely on domain labels that may not be available and often ambiguous. We instead study the DG setting where models must generalize well without access to explicit domain labels. Our key idea is to represent an unseen target domain as a combination of latent domains automatically discovered from training data, enabling the model to adaptively transfer knowledge across domains. To realize this, we perform latent domain clustering on image features and fuse domain-specific text features based on the similarity between the input image and each latent domain. Experiments on four benchmarks show that this strategy yields consistent gains over VLM-based baselines and provides new insights into improving robustness under domain shift.

Latent Domain Prompt Learning for Vision-Language Models

TL;DR

This work tackles domain generalization for vision–language models without relying on explicit domain labels. It introduces Latent Domain Prompt Fusion (LDPF), a framework that discovers latent domains through adversarial feature extraction and k-means clustering, and then fuses domain-specific soft prompts using a domain-similarity-based weighting to improve cross-domain alignment. The method employs a two-stage prompt learning regime with losses , , and , and a total objective , while keeping the encoders frozen. Across four benchmarks, LDPF yields consistent gains over domain-label-free baselines and demonstrates competitive performance relative to methods that use explicit domain labels, though an upper-bound analysis reveals limitations of simple fusion in highly specialized domains. The findings highlight the practical potential of unlabeled latent-domain adaptation for robust vision–language understanding and point to future work on more expressive fusion or gating mechanisms to fully exploit promoter complementarity.

Abstract

The objective of domain generalization (DG) is to enable models to be robust against domain shift. DG is crucial for deploying vision-language models (VLMs) in real-world applications, yet most existing methods rely on domain labels that may not be available and often ambiguous. We instead study the DG setting where models must generalize well without access to explicit domain labels. Our key idea is to represent an unseen target domain as a combination of latent domains automatically discovered from training data, enabling the model to adaptively transfer knowledge across domains. To realize this, we perform latent domain clustering on image features and fuse domain-specific text features based on the similarity between the input image and each latent domain. Experiments on four benchmarks show that this strategy yields consistent gains over VLM-based baselines and provides new insights into improving robustness under domain shift.

Paper Structure

This paper contains 13 sections, 10 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the proposed framework. During training, the image and text encoders are frozen, and the Latent Domain Model assigns latent domain labels to guide the learning of domain-specific prompts. At inference, the model estimates the similarity between the input image and each latent domain, and fuses text features accordingly for better modality alignment, i.e., increasing the cosine similarity between image feature and the corresponding positive text feature.
  • Figure 2: Oracle upper bound $U_{\text{sel}}$ on two datasets.