Table of Contents
Fetching ...

When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective

Hsi-Ai Tsao, Lei Hsiung, Pin-Yu Chen, Tsung-Yi Ho

TL;DR

A log-likelihood ratio (LLR) approach is proposed to analyze the comparative benefits of visual prompting and linear probing and attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%.

Abstract

Adapting pre-trained models to new tasks can exhibit varying effectiveness across datasets. Visual prompting, a state-of-the-art parameter-efficient transfer learning method, can significantly improve the performance of out-of-distribution tasks. On the other hand, linear probing, a standard transfer learning method, can sometimes become the best approach. We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing. By employing the LLR score alongside resource-efficient visual prompts approximations, our cost-effective measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%. The source code is available at https://github.com/IBM/VP-LLR.

When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective

TL;DR

A log-likelihood ratio (LLR) approach is proposed to analyze the comparative benefits of visual prompting and linear probing and attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%.

Abstract

Adapting pre-trained models to new tasks can exhibit varying effectiveness across datasets. Visual prompting, a state-of-the-art parameter-efficient transfer learning method, can significantly improve the performance of out-of-distribution tasks. On the other hand, linear probing, a standard transfer learning method, can sometimes become the best approach. We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing. By employing the LLR score alongside resource-efficient visual prompts approximations, our cost-effective measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%. The source code is available at https://github.com/IBM/VP-LLR.
Paper Structure (18 sections, 10 equations, 9 figures, 4 tables)

This paper contains 18 sections, 10 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: The PCC of Embeddings in ResNet18. The $PCC_{X,Y}$ is calculated by embeddings $X$ and $Y$. Here, $X$ is obtained by inputting the entire prompted image, while $Y$ is obtained by inputting either (1) the visual prompts or (2) the clean image.
  • Figure 2: The Visual Prompting Framework.
  • Figure 3: The Similarity of Visual Prompts on CLIP (ViT/B-32). Various types of prompts are presented in the frequency domain, along with line plots of the average values with radii. The similarity between simulated and trained visual prompts is evaluated using KL divergence kullback1951information.
  • Figure 4: Accuracy and LLR on the Combined Datasets Using CLIP (ViT-B/32). SVHN is considered more OOD, while DTD tends to be ID. There are 10 classes in SVHN, and we gradually increase the number of classes in DTD from 2 to 45 to obtain different ID/OOD proportions.
  • Figure 5: The \ref{['eq:LogME_VP']} Scores with Prompts. The plot shows the \ref{['eq:LogME_VP']} scores obtained from CLIP (ViT-B/32) with various input prompts, including without prompts, Gaussian prompts, gradient prompts, mini-finetuning prompts, and well-trained prompts.
  • ...and 4 more figures