Table of Contents
Fetching ...

Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning

Kun Ding, Haojian Zhang, Qiang Yu, Ying Wang, Shiming Xiang, Chunhong Pan

TL;DR

The paper tackles poor base-to-novel generalization in vision-language prompt tuning by introducing a test-time, OOD-driven fusion of zero-shot and few-shot classifiers. It defines a competition-based scoring function $s(x)$ that blends the two classifiers dynamically, using per-sample OOD cues to bias toward the base- or novel-distribution classifier without retraining. Empirical results across 11 base-to-novel datasets and domain-generalization benchmarks show consistent harmonic-mean improvements, with state-of-the-art gains when combining complementary VLPT methods. The approach demonstrates that even weak OOD detectors can meaningfully enhance VLPT generalization and provides a practical, plug-in mechanism for existing models.

Abstract

We propose a generalized method for boosting the generalization ability of pre-trained vision-language models (VLMs) while fine-tuning on downstream few-shot tasks. The idea is realized by exploiting out-of-distribution (OOD) detection to predict whether a sample belongs to a base distribution or a novel distribution and then using the score generated by a dedicated competition based scoring function to fuse the zero-shot and few-shot classifier. The fused classifier is dynamic, which will bias towards the zero-shot classifier if a sample is more likely from the distribution pre-trained on, leading to improved base-to-novel generalization ability. Our method is performed only in test stage, which is applicable to boost existing methods without time-consuming re-training. Extensive experiments show that even weak distribution detectors can still improve VLMs' generalization ability. Specifically, with the help of OOD detectors, the harmonic mean of CoOp and ProGrad increase by 2.6 and 1.5 percentage points over 11 recognition datasets in the base-to-novel setting.

Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning

TL;DR

The paper tackles poor base-to-novel generalization in vision-language prompt tuning by introducing a test-time, OOD-driven fusion of zero-shot and few-shot classifiers. It defines a competition-based scoring function that blends the two classifiers dynamically, using per-sample OOD cues to bias toward the base- or novel-distribution classifier without retraining. Empirical results across 11 base-to-novel datasets and domain-generalization benchmarks show consistent harmonic-mean improvements, with state-of-the-art gains when combining complementary VLPT methods. The approach demonstrates that even weak OOD detectors can meaningfully enhance VLPT generalization and provides a practical, plug-in mechanism for existing models.

Abstract

We propose a generalized method for boosting the generalization ability of pre-trained vision-language models (VLMs) while fine-tuning on downstream few-shot tasks. The idea is realized by exploiting out-of-distribution (OOD) detection to predict whether a sample belongs to a base distribution or a novel distribution and then using the score generated by a dedicated competition based scoring function to fuse the zero-shot and few-shot classifier. The fused classifier is dynamic, which will bias towards the zero-shot classifier if a sample is more likely from the distribution pre-trained on, leading to improved base-to-novel generalization ability. Our method is performed only in test stage, which is applicable to boost existing methods without time-consuming re-training. Extensive experiments show that even weak distribution detectors can still improve VLMs' generalization ability. Specifically, with the help of OOD detectors, the harmonic mean of CoOp and ProGrad increase by 2.6 and 1.5 percentage points over 11 recognition datasets in the base-to-novel setting.
Paper Structure (14 sections, 1 theorem, 9 equations, 3 figures, 5 tables)

This paper contains 14 sections, 1 theorem, 9 equations, 3 figures, 5 tables.

Key Result

Proposition 1

The harmonic mean accuracy $H$ of the fused classifier $\mathbf{W}$ with the Heaviside step function used on base and novel set can be represented as:

Figures (3)

  • Figure 1: The relation between harmonic mean and ID prediction accuracies when $\alpha=\infty$ (in Eq. \ref{['eq:score_func']}). $r_b$ and $1-r_n$ are the in-distribution prediction accuracy of samples in base and novel set. The maximum harmonic mean $H_{max}$ is acquired when $r_b=1$ and $r_n=0$. $H_{fs}$ and $H_{zs}$ denote the harmonic mean obtained by applying the few-shot and zero-shot classifier on all test samples, respectively.
  • Figure 2: The framework of the proposed method.
  • Figure 3: Effect of $\alpha$ on accuracies in CoOp+.

Theorems & Definitions (2)

  • Proposition 1
  • proof