Table of Contents
Fetching ...

Transitive Vision-Language Prompt Learning for Domain Generalization

Liyuan Wang, Yan Jin, Zhen Chen, Jinlin Wu, Mengke Li, Yang Lu, Hanzi Wang

TL;DR

Transitive Vision-Language Prompt Learning (TPL) addresses domain generalization by decoupling domain invariance and class separability into vision and language prompts, respectively. It introduces a transitive learning strategy and adaptive fusion to maintain CLIP alignment while promoting domain-robust representations. Empirical results on PACS, VLCS, and OfficeHome show state-of-the-art performance and improved inter-domain invariance as well as intra-domain separability. The work demonstrates the value of leveraging cross-modal prompts and adaptive balance for robust DG across diverse environments.

Abstract

The vision-language pre-training has enabled deep models to make a huge step forward in generalizing across unseen domains. The recent learning method based on the vision-language pre-training model is a great tool for domain generalization and can solve this problem to a large extent. However, there are still some issues that an advancement still suffers from trading-off between domain invariance and class separability, which are crucial in current DG problems. However, there are still some issues that an advancement still suffers from trading-off between domain invariance and class separability, which are crucial in current DG problems. In this paper, we introduce a novel prompt learning strategy that leverages deep vision prompts to address domain invariance while utilizing language prompts to ensure class separability, coupled with adaptive weighting mechanisms to balance domain invariance and class separability. Extensive experiments demonstrate that deep vision prompts effectively extract domain-invariant features, significantly improving the generalization ability of deep models and achieving state-of-the-art performance on three datasets.

Transitive Vision-Language Prompt Learning for Domain Generalization

TL;DR

Transitive Vision-Language Prompt Learning (TPL) addresses domain generalization by decoupling domain invariance and class separability into vision and language prompts, respectively. It introduces a transitive learning strategy and adaptive fusion to maintain CLIP alignment while promoting domain-robust representations. Empirical results on PACS, VLCS, and OfficeHome show state-of-the-art performance and improved inter-domain invariance as well as intra-domain separability. The work demonstrates the value of leveraging cross-modal prompts and adaptive balance for robust DG across diverse environments.

Abstract

The vision-language pre-training has enabled deep models to make a huge step forward in generalizing across unseen domains. The recent learning method based on the vision-language pre-training model is a great tool for domain generalization and can solve this problem to a large extent. However, there are still some issues that an advancement still suffers from trading-off between domain invariance and class separability, which are crucial in current DG problems. However, there are still some issues that an advancement still suffers from trading-off between domain invariance and class separability, which are crucial in current DG problems. In this paper, we introduce a novel prompt learning strategy that leverages deep vision prompts to address domain invariance while utilizing language prompts to ensure class separability, coupled with adaptive weighting mechanisms to balance domain invariance and class separability. Extensive experiments demonstrate that deep vision prompts effectively extract domain-invariant features, significantly improving the generalization ability of deep models and achieving state-of-the-art performance on three datasets.
Paper Structure (25 sections, 10 equations, 4 figures, 3 tables)

This paper contains 25 sections, 10 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: 2D feature space visualization on the PACS dataset by (a) jointly training vision and language prompts and (b) our proposed learning strategy. Shapes represent domains, and colors denote classes. The ellipses illustrate the degree of domain invariance by fitting a Gaussian model.
  • Figure 2: Overview of the proposed TPL. Vision prompts are trained to enhance domain invariant features. Then they are used as the input of the domain prompt generator to produce domain-specific prompts for the text encoder. Finally, the transitive learning strategy combines the above two components to balance domain invariance and class separability during the whole tuning process.
  • Figure 3: (a) The average distance between any two domains in each class on the VLCS dataset demonstrates inter-class domain invariance. Different colors indicate different classes. (b) The average distance between any two classes in each domain on the PACS dataset demonstrates intra-domain class separability. Different colors indicate different domains.
  • Figure 4: (a) Average inter-domain distance on the VLCS dataset. (b) Domain invariant weights on the VLCS dataset.