Table of Contents
Fetching ...

HCVP: Leveraging Hierarchical Contrastive Visual Prompt for Domain Generalization

Guanglin Zhou, Zhongyi Han, Shiming Chen, Biwei Huang, Liming Zhu, Tongliang Liu, Lina Yao, Kun Zhang

TL;DR

This work tackles domain generalization by addressing the tendency of fixed-parameter models to conflate invariant and domain-specific features. It introduces Hierarchical Contrastive Visual Prompt (HCVP), which generates domain-level and task-specific prompts through a two-tier Hierarchical Prompt Generation Network and injects them into a ViT backbone via a Prompt Modulation Network. Two contrastive losses, Prompt Contrastive Learning (PCL) and Class-conditioned Contrastive Invariance (CCI), align prompts with domain and class structure while preserving cross-domain invariance, guided by mutual information objectives. Across five DG benchmarks, HCVP achieves state-of-the-art average performance and demonstrates robustness under diverse distribution shifts, with ablations confirming the value of each loss component. The approach offers a scalable, end-to-end framework for enriching pretrained visual models with structured, instance-dependent prompts to improve domain generalization and suggests potential for cross-modal extensions.

Abstract

Domain Generalization (DG) endeavors to create machine learning models that excel in unseen scenarios by learning invariant features. In DG, the prevalent practice of constraining models to a fixed structure or uniform parameterization to encapsulate invariant features can inadvertently blend specific aspects. Such an approach struggles with nuanced differentiation of inter-domain variations and may exhibit bias towards certain domains, hindering the precise learning of domain-invariant features. Recognizing this, we introduce a novel method designed to supplement the model with domain-level and task-specific characteristics. This approach aims to guide the model in more effectively separating invariant features from specific characteristics, thereby boosting the generalization. Building on the emerging trend of visual prompts in the DG paradigm, our work introduces the novel \textbf{H}ierarchical \textbf{C}ontrastive \textbf{V}isual \textbf{P}rompt (HCVP) methodology. This represents a significant advancement in the field, setting itself apart with a unique generative approach to prompts, alongside an explicit model structure and specialized loss functions. Differing from traditional visual prompts that are often shared across entire datasets, HCVP utilizes a hierarchical prompt generation network enhanced by prompt contrastive learning. These generative prompts are instance-dependent, catering to the unique characteristics inherent to different domains and tasks. Additionally, we devise a prompt modulation network that serves as a bridge, effectively incorporating the generated visual prompts into the vision transformer backbone. Experiments conducted on five DG datasets demonstrate the effectiveness of HCVP, outperforming both established DG algorithms and adaptation protocols.

HCVP: Leveraging Hierarchical Contrastive Visual Prompt for Domain Generalization

TL;DR

This work tackles domain generalization by addressing the tendency of fixed-parameter models to conflate invariant and domain-specific features. It introduces Hierarchical Contrastive Visual Prompt (HCVP), which generates domain-level and task-specific prompts through a two-tier Hierarchical Prompt Generation Network and injects them into a ViT backbone via a Prompt Modulation Network. Two contrastive losses, Prompt Contrastive Learning (PCL) and Class-conditioned Contrastive Invariance (CCI), align prompts with domain and class structure while preserving cross-domain invariance, guided by mutual information objectives. Across five DG benchmarks, HCVP achieves state-of-the-art average performance and demonstrates robustness under diverse distribution shifts, with ablations confirming the value of each loss component. The approach offers a scalable, end-to-end framework for enriching pretrained visual models with structured, instance-dependent prompts to improve domain generalization and suggests potential for cross-modal extensions.

Abstract

Domain Generalization (DG) endeavors to create machine learning models that excel in unseen scenarios by learning invariant features. In DG, the prevalent practice of constraining models to a fixed structure or uniform parameterization to encapsulate invariant features can inadvertently blend specific aspects. Such an approach struggles with nuanced differentiation of inter-domain variations and may exhibit bias towards certain domains, hindering the precise learning of domain-invariant features. Recognizing this, we introduce a novel method designed to supplement the model with domain-level and task-specific characteristics. This approach aims to guide the model in more effectively separating invariant features from specific characteristics, thereby boosting the generalization. Building on the emerging trend of visual prompts in the DG paradigm, our work introduces the novel \textbf{H}ierarchical \textbf{C}ontrastive \textbf{V}isual \textbf{P}rompt (HCVP) methodology. This represents a significant advancement in the field, setting itself apart with a unique generative approach to prompts, alongside an explicit model structure and specialized loss functions. Differing from traditional visual prompts that are often shared across entire datasets, HCVP utilizes a hierarchical prompt generation network enhanced by prompt contrastive learning. These generative prompts are instance-dependent, catering to the unique characteristics inherent to different domains and tasks. Additionally, we devise a prompt modulation network that serves as a bridge, effectively incorporating the generated visual prompts into the vision transformer backbone. Experiments conducted on five DG datasets demonstrate the effectiveness of HCVP, outperforming both established DG algorithms and adaptation protocols.
Paper Structure (32 sections, 12 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 32 sections, 12 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Motivation illustration. (a) Traditional DG methods, employing universal parameters across the entire dataset, often struggle to distinguish between invariant shape attributes (e.g., large ears, long nose) and domain-specific texture attributes (e.g., thick grey skin). This difficulty leads to a blending of features, thereby diminishing the model's capacity for generalization. (b) Our approach introduces hierarchical visual prompts that encapsulate domain-level and task-specific characteristics, enabling the model to better differentiate and understand both invariant and specific attributes, thereby contributing to more effective generalization across different domains.
  • Figure 2: The architecture overview of the proposed Hierarchical Contrastive Visual Prompt (HCVP) model. HCVP comprises two key components: the Hierarchical Prompt Generation Network (HPGN) and the Prompt Modulation Network (PMN). The HPGN first uses a pretrained encoder to extract feature maps. These feature maps are then processed through a dual-level generation module for domain-level and task-specific prompt generation, respectively. The PMN serves as a conduit, integrating the prompts generated by the HPGN into the ViT layers. Additionally, HCVP incorporates two contrastive learning strategies: Prompt Contrastive Learning (PCL), which optimizes the generation of both domain and task-specific visual prompts, and Class-conditional Contrastive Invariance (CCI), which enhances the model's class-specific discriminative power.
  • Figure 3: Comparison of inter-domain feature distances for ERM and HCVP across multiple datasets. It illustrates the effectiveness of HCVP in achieving lower inter-domain distances compared to ERM, suggesting a stronger capability for domain-invariant feature learning.
  • Figure 4: The t-SNE visualizations of visual features in our HCVP, for the last unseen domain on PACS, VLCS, OfficeHome, CelebA, and TerraIncognita datasets. Forty instances are sampled within each class. Additionally, we select the first ten classes in the OfficeHome dataset.
  • Figure 5: The t-SNE visualizations of domain-level and task-specific prompts on the PACS dataset, representing four domains and seven class labels. Thirty instances are sampled within each class for visualization.
  • ...and 1 more figures