Table of Contents
Fetching ...

Dropout Prompt Learning: Towards Robust and Adaptive Vision-Language Models

Biao Chen, Lin Zuo, Mengmeng Jing, Kunbin He, Yuchen Wang

TL;DR

The paper addresses overfitting in vision-language prompt learning by introducing Dropout Prompt Learning, which applies token-level dropout to both visual and textual branches. It introduces Importance Weighted Token Dropout (IWTD) guided by a multimodal importance metric and couples dropout with Residual Entropy Regularization to preserve cross-modal alignment while fostering diverse representations. The approach is validated across 15 benchmarks, showing improved base-to-novel generalization, few-shot performance, and out-of-distribution robustness, with ablations confirming the critical role of cross-modal attention signals and residual entropy. The method maintains competitive computational efficiency and demonstrates broad applicability across architectures and adapters, highlighting its practical impact for robust, data-efficient VLM adaptation.

Abstract

Dropout is a widely used regularization technique which improves the generalization ability of a model by randomly dropping neurons. In light of this, we propose Dropout Prompt Learning, which aims for applying dropout to improve the robustness of the vision-language models. Different from the vanilla dropout, we apply dropout on the tokens of the textual and visual branches, where we evaluate the token significance considering both intra-modal context and inter-modal alignment, enabling flexible dropout probabilities for each token. Moreover, to maintain semantic alignment for general knowledge transfer while encouraging the diverse representations that dropout introduces, we further propose residual entropy regularization. Experiments on 15 benchmarks show our method's effectiveness in challenging scenarios like low-shot learning, long-tail classification, and out-of-distribution generalization. Notably, our method surpasses regularization-based methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization.

Dropout Prompt Learning: Towards Robust and Adaptive Vision-Language Models

TL;DR

The paper addresses overfitting in vision-language prompt learning by introducing Dropout Prompt Learning, which applies token-level dropout to both visual and textual branches. It introduces Importance Weighted Token Dropout (IWTD) guided by a multimodal importance metric and couples dropout with Residual Entropy Regularization to preserve cross-modal alignment while fostering diverse representations. The approach is validated across 15 benchmarks, showing improved base-to-novel generalization, few-shot performance, and out-of-distribution robustness, with ablations confirming the critical role of cross-modal attention signals and residual entropy. The method maintains competitive computational efficiency and demonstrates broad applicability across architectures and adapters, highlighting its practical impact for robust, data-efficient VLM adaptation.

Abstract

Dropout is a widely used regularization technique which improves the generalization ability of a model by randomly dropping neurons. In light of this, we propose Dropout Prompt Learning, which aims for applying dropout to improve the robustness of the vision-language models. Different from the vanilla dropout, we apply dropout on the tokens of the textual and visual branches, where we evaluate the token significance considering both intra-modal context and inter-modal alignment, enabling flexible dropout probabilities for each token. Moreover, to maintain semantic alignment for general knowledge transfer while encouraging the diverse representations that dropout introduces, we further propose residual entropy regularization. Experiments on 15 benchmarks show our method's effectiveness in challenging scenarios like low-shot learning, long-tail classification, and out-of-distribution generalization. Notably, our method surpasses regularization-based methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization.

Paper Structure

This paper contains 33 sections, 1 theorem, 27 equations, 7 figures, 9 tables.

Key Result

Proposition 1

Given a training sample $\mathcal{X}$ of $n$ instances, let $\mathbb{F}_{\mathbf{q}}$ be the hypothesis space induced by layer-wise token retention probabilities $\mathbf{q}$, and $h \in \mathbb{F}_{\mathbf{q}}$ be a learned hypothesis. The expected risk $R(h)$ is bounded, with probability at least where $\hat{R}_{\mathcal{X}}(h)$ denotes the empirical risk, $\ell$ represents the $C_{\ell}$-Lipsc

Figures (7)

  • Figure 1: (a) Vanilla dropout randomly removes visual tokens, disrupting image-text alignment (top). Importance Weighted Token Dropout preserves semantically relevant tokens for alignment (bottom). (b) Comparison on base-to-novel, out-of-distribution generalization and few-shot image classification.
  • Figure 2: Method overview of Importance Weighted Token Dropout. Textual and visual modalities are processed by parallel encoding pathways, a frozen branch and a learnable branch. In the learnable branch, we compute an intra-/inter-modal importance metric for tokens at each layer, which guides adaptive token dropout. Then, residual features derive from learnable and frozen branch differences. Finally, maximizing entropy constrains dropout for both visual and textual residuals.
  • Figure 3: Multimodal Importance Metric, which simultaneously considers both intra-modal attention $S_{self}$, $S_{cls}$ and inter-modal attention $S_{cross}$.
  • Figure 4: (a) Base-to-novel task with imbalance ratio 10. (b) Few-shot classification. We conduct experiments on 11 datasets. Per-dataset results are in the Appendix B.2.
  • Figure 5: (a) Grad-CAM visualizations for different dropout methods. Redder colors indicate higher feature attention. (b) Computational cost and performance of different prompt learning methods.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof