Table of Contents
Fetching ...

Style-Pro: Style-Guided Prompt Learning for Generalizable Vision-Language Models

Niloufar Alipour Talemi, Hossein Kashiani, Fatemeh Afghah

TL;DR

Style-Pro is proposed, a novel style-guided prompt learning framework that mitigates overfitting and preserves the zero-shot generalization capabilities of CLIP, and consistently surpasses state-of-the-art methods in various settings, including base-to-new generalization, cross-dataset transfer, and domain generalization.

Abstract

Pre-trained Vision-language (VL) models, such as CLIP, have shown significant generalization ability to downstream tasks, even with minimal fine-tuning. While prompt learning has emerged as an effective strategy to adapt pre-trained VL models for downstream tasks, current approaches frequently encounter severe overfitting to specific downstream data distributions. This overfitting constrains the original behavior of the VL models to generalize to new domains or unseen classes, posing a critical challenge in enhancing the adaptability and generalization of VL models. To address this limitation, we propose Style-Pro, a novel style-guided prompt learning framework that mitigates overfitting and preserves the zero-shot generalization capabilities of CLIP. Style-Pro employs learnable style bases to synthesize diverse distribution shifts, guided by two specialized loss functions that ensure style diversity and content integrity. Then, to minimize discrepancies between unseen domains and the source domain, Style-Pro maps the unseen styles into the known style representation space as a weighted combination of style bases. Moreover, to maintain consistency between the style-shifted prompted model and the original frozen CLIP, Style-Pro introduces consistency constraints to preserve alignment in the learned embeddings, minimizing deviation during adaptation to downstream tasks. Extensive experiments across 11 benchmark datasets demonstrate the effectiveness of Style-Pro, consistently surpassing state-of-the-art methods in various settings, including base-to-new generalization, cross-dataset transfer, and domain generalization.

Style-Pro: Style-Guided Prompt Learning for Generalizable Vision-Language Models

TL;DR

Style-Pro is proposed, a novel style-guided prompt learning framework that mitigates overfitting and preserves the zero-shot generalization capabilities of CLIP, and consistently surpasses state-of-the-art methods in various settings, including base-to-new generalization, cross-dataset transfer, and domain generalization.

Abstract

Pre-trained Vision-language (VL) models, such as CLIP, have shown significant generalization ability to downstream tasks, even with minimal fine-tuning. While prompt learning has emerged as an effective strategy to adapt pre-trained VL models for downstream tasks, current approaches frequently encounter severe overfitting to specific downstream data distributions. This overfitting constrains the original behavior of the VL models to generalize to new domains or unseen classes, posing a critical challenge in enhancing the adaptability and generalization of VL models. To address this limitation, we propose Style-Pro, a novel style-guided prompt learning framework that mitigates overfitting and preserves the zero-shot generalization capabilities of CLIP. Style-Pro employs learnable style bases to synthesize diverse distribution shifts, guided by two specialized loss functions that ensure style diversity and content integrity. Then, to minimize discrepancies between unseen domains and the source domain, Style-Pro maps the unseen styles into the known style representation space as a weighted combination of style bases. Moreover, to maintain consistency between the style-shifted prompted model and the original frozen CLIP, Style-Pro introduces consistency constraints to preserve alignment in the learned embeddings, minimizing deviation during adaptation to downstream tasks. Extensive experiments across 11 benchmark datasets demonstrate the effectiveness of Style-Pro, consistently surpassing state-of-the-art methods in various settings, including base-to-new generalization, cross-dataset transfer, and domain generalization.

Paper Structure

This paper contains 15 sections, 12 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of the proposed style shift learning approach. Simply mixing feature statistics zhou2021domainwang2022feature from the source domain does not generate styles sufficiently distinct from the source domain. Style-Pro addresses this by employing a learnable set of style bases to synthesize style bases beyond the source domain. Furthermore, mapping unseen styles into the style representation space as a weighted combination of these style bases reduces the discrepancy between unseen domains and the source domain.
  • Figure 2: Overview of the proposed Style-Pro framework. Style-Pro introduces a style-guided prompt learning framework, incorporating a novel style shift learning approach in the feature space through a learnable set of style bases. Unseen styles are mapped into the style representation space as a weighted combination of style bases, reducing style discrepancies and improving performance on OOD data. Furthermore, Style-Pro ensures consistency between the embeddings of the prompted and frozen models during adaptation, which facilitates fine-tuning of CLIP while preserving its generalization capabilities.
  • Figure 3: (a) Ablation study on style shift learning at different layers of the vision encoder. (b) Ablation study on the impact of the number of learnable style bases.