Table of Contents
Fetching ...

DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models

Niloufar Alipour Talemi, Hossein Kashiani, Hossein R. Nowdeh, Fatemeh Afghah

TL;DR

DiSa tackles the generalization gap in prompt-learning for vision-language models by introducing Cross-Interactive Regularization (CIR) and Directional Regularization (DiR). CIR fosters cross-modal interaction between prompted and frozen encoders and employs saliency-aware masking to emphasize semantically important image regions, while DiR aligns prompted features with class-wise prototypes derived from the frozen model, focusing on directional alignment. The combined objective $L_{total} = L_{CE} + L_{SR} + L_{CIR} + λ L_{DiR}$ enables robust generalization to novel classes and domains. In extensive experiments across 11 diverse benchmarks, DiSa consistently outperforms state-of-the-art prompt-learning methods across base-to-novel generalization, cross-dataset transfer, domain generalization, and few-shot settings, with modest training overhead.

Abstract

Prompt learning has emerged as a powerful paradigm for adapting vision-language models such as CLIP to downstream tasks. However, existing methods often overfit to seen data, leading to significant performance degradation when generalizing to novel classes or unseen domains. To address this limitation, we propose DiSa, a Directional Saliency-Aware Prompt Learning framework that integrates two complementary regularization strategies to enhance generalization. First, our Cross-Interactive Regularization (CIR) fosters cross-modal alignment by enabling cooperative learning between prompted and frozen encoders. Within CIR, a saliency-aware masking strategy guides the image encoder to prioritize semantically critical image regions, reducing reliance on less informative patches. Second, we introduce a directional regularization strategy that aligns visual embeddings with class-wise prototype features in a directional manner to prioritize consistency in feature orientation over strict proximity. This approach ensures robust generalization by leveraging stable prototype directions derived from class-mean statistics. Extensive evaluations on 11 diverse image classification benchmarks demonstrate that DiSa consistently outperforms state-of-the-art prompt learning methods across various settings, including base-to-novel generalization, cross-dataset transfer, domain generalization, and few-shot learning.

DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models

TL;DR

DiSa tackles the generalization gap in prompt-learning for vision-language models by introducing Cross-Interactive Regularization (CIR) and Directional Regularization (DiR). CIR fosters cross-modal interaction between prompted and frozen encoders and employs saliency-aware masking to emphasize semantically important image regions, while DiR aligns prompted features with class-wise prototypes derived from the frozen model, focusing on directional alignment. The combined objective enables robust generalization to novel classes and domains. In extensive experiments across 11 diverse benchmarks, DiSa consistently outperforms state-of-the-art prompt-learning methods across base-to-novel generalization, cross-dataset transfer, domain generalization, and few-shot settings, with modest training overhead.

Abstract

Prompt learning has emerged as a powerful paradigm for adapting vision-language models such as CLIP to downstream tasks. However, existing methods often overfit to seen data, leading to significant performance degradation when generalizing to novel classes or unseen domains. To address this limitation, we propose DiSa, a Directional Saliency-Aware Prompt Learning framework that integrates two complementary regularization strategies to enhance generalization. First, our Cross-Interactive Regularization (CIR) fosters cross-modal alignment by enabling cooperative learning between prompted and frozen encoders. Within CIR, a saliency-aware masking strategy guides the image encoder to prioritize semantically critical image regions, reducing reliance on less informative patches. Second, we introduce a directional regularization strategy that aligns visual embeddings with class-wise prototype features in a directional manner to prioritize consistency in feature orientation over strict proximity. This approach ensures robust generalization by leveraging stable prototype directions derived from class-mean statistics. Extensive evaluations on 11 diverse image classification benchmarks demonstrate that DiSa consistently outperforms state-of-the-art prompt learning methods across various settings, including base-to-novel generalization, cross-dataset transfer, domain generalization, and few-shot learning.

Paper Structure

This paper contains 17 sections, 9 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 2: Overview of the proposed DiSa. The DiSa employs two complementary regularization approaches: saliency-aware cross-interactive regularization and directional regularization. The saliency-aware cross-interactive framework ensures that the prompted encoders establish independent, saliency-based interactions with the cross-modality outputs of the frozen encoders. Meanwhile, the directional regularization aligns prompted features with classification goals using class-wise feature means from the frozen model as robust prototypes. Note that the saliency masking component evaluates the importance of image tokens by computing attention scores between image patch tokens and the $CLS$ token from the frozen model's text encoder. For visual clarity, it should also be noted that we employ distinct arrow styles to illustrate different data flows: dashed blue arrows indicate the path of the full (unmasked) image through the prompted vision encoder; solid pink arrows represent the saliency-masked image path; and solid black arrows denote flows without multiple input dependencies.
  • Figure 3: Performance comparison across K-shot settings (K = 1, 2, 4, 8, 16). Our approach consistently achieves superior average performance, with notable gains in low-shot scenarios, particularly for K=1, 2, 4.
  • Figure 4: Analysis of saliency masking and the directional regularization. (a) Accuracy vs. percentage of least informative patches masked, showing optimal performance for novel classes at 25–35% masking. (b) Random masking of 50% least informative patches improves novel-class accuracy, with stability for base classes at 25% masking. (c) Comparison of feature alignment strategies, highlighting directional alignment as most effective for improving generalization.