When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective

Hsi-Ai Tsao; Lei Hsiung; Pin-Yu Chen; Tsung-Yi Ho

When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective

Hsi-Ai Tsao, Lei Hsiung, Pin-Yu Chen, Tsung-Yi Ho

TL;DR

A log-likelihood ratio (LLR) approach is proposed to analyze the comparative benefits of visual prompting and linear probing and attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%.

Abstract

Adapting pre-trained models to new tasks can exhibit varying effectiveness across datasets. Visual prompting, a state-of-the-art parameter-efficient transfer learning method, can significantly improve the performance of out-of-distribution tasks. On the other hand, linear probing, a standard transfer learning method, can sometimes become the best approach. We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing. By employing the LLR score alongside resource-efficient visual prompts approximations, our cost-effective measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%. The source code is available at https://github.com/IBM/VP-LLR.

When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective

TL;DR

Abstract

Paper Structure (18 sections, 10 equations, 9 figures, 4 tables)

This paper contains 18 sections, 10 equations, 9 figures, 4 tables.

Introduction
Background and Related Work
Visual Prompting
LogME for Model Selection
Methodology
Log-Likelihood Ratio
LogME Evidence and Visual Prompting Evidence
Visual Prompt Approximation
Experimental Results
The Effectiveness of LLR and Simulated Prompts
The Sorting Results with Diverse Datasets
Conclusion
Datasets and Pre-Trained Models
VP and LP Performance
Feature Extraction for Evidence Score Calculation
...and 3 more sections

Figures (9)

Figure 1: The PCC of Embeddings in ResNet18. The $PCC_{X,Y}$ is calculated by embeddings $X$ and $Y$. Here, $X$ is obtained by inputting the entire prompted image, while $Y$ is obtained by inputting either (1) the visual prompts or (2) the clean image.
Figure 2: The Visual Prompting Framework.
Figure 3: The Similarity of Visual Prompts on CLIP (ViT/B-32). Various types of prompts are presented in the frequency domain, along with line plots of the average values with radii. The similarity between simulated and trained visual prompts is evaluated using KL divergence kullback1951information.
Figure 4: Accuracy and LLR on the Combined Datasets Using CLIP (ViT-B/32). SVHN is considered more OOD, while DTD tends to be ID. There are 10 classes in SVHN, and we gradually increase the number of classes in DTD from 2 to 45 to obtain different ID/OOD proportions.
Figure 5: The \ref{['eq:LogME_VP']} Scores with Prompts. The plot shows the \ref{['eq:LogME_VP']} scores obtained from CLIP (ViT-B/32) with various input prompts, including without prompts, Gaussian prompts, gradient prompts, mini-finetuning prompts, and well-trained prompts.
...and 4 more figures

When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective

TL;DR

Abstract

When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (9)