Table of Contents
Fetching ...

A Survey of Low-shot Vision-Language Model Adaptation via Representer Theorem

Kun Ding, Ying Wang, Gaofeng Meng, Shiming Xiang

TL;DR

A unified computational framework is proposed from the perspective of Representer Theorem and then derives many of the existing methods by specializing this framework by exploiting the closed-form solution of kernel ridge regression.

Abstract

The advent of pre-trained vision-language foundation models has revolutionized the field of zero/few-shot (i.e., low-shot) image recognition. The key challenge to address under the condition of limited training data is how to fine-tune pre-trained vision-language models in a parameter-efficient manner. Previously, numerous approaches tackling this challenge have been proposed. Meantime, a few survey papers are also published to summarize these works. However, there still lacks a unified computational framework to integrate existing methods together, identify their nature and support in-depth comparison. As such, this survey paper first proposes a unified computational framework from the perspective of Representer Theorem and then derives many of the existing methods by specializing this framework. Thereafter, a comparative analysis is conducted to uncover the differences and relationships between existing methods. Based on the analyses, some possible variants to improve the existing works are presented. As a demonstration, we extend existing methods by modeling inter-class correlation between representers in reproducing kernel Hilbert space (RKHS), which is implemented by exploiting the closed-form solution of kernel ridge regression. Extensive experiments on 11 datasets are conducted to validate the effectiveness of this method. Toward the end of this paper, we discuss the limitations and provide further research directions.

A Survey of Low-shot Vision-Language Model Adaptation via Representer Theorem

TL;DR

A unified computational framework is proposed from the perspective of Representer Theorem and then derives many of the existing methods by specializing this framework by exploiting the closed-form solution of kernel ridge regression.

Abstract

The advent of pre-trained vision-language foundation models has revolutionized the field of zero/few-shot (i.e., low-shot) image recognition. The key challenge to address under the condition of limited training data is how to fine-tune pre-trained vision-language models in a parameter-efficient manner. Previously, numerous approaches tackling this challenge have been proposed. Meantime, a few survey papers are also published to summarize these works. However, there still lacks a unified computational framework to integrate existing methods together, identify their nature and support in-depth comparison. As such, this survey paper first proposes a unified computational framework from the perspective of Representer Theorem and then derives many of the existing methods by specializing this framework. Thereafter, a comparative analysis is conducted to uncover the differences and relationships between existing methods. Based on the analyses, some possible variants to improve the existing works are presented. As a demonstration, we extend existing methods by modeling inter-class correlation between representers in reproducing kernel Hilbert space (RKHS), which is implemented by exploiting the closed-form solution of kernel ridge regression. Extensive experiments on 11 datasets are conducted to validate the effectiveness of this method. Toward the end of this paper, we discuss the limitations and provide further research directions.

Paper Structure

This paper contains 39 sections, 1 theorem, 59 equations, 5 figures, 4 tables.

Key Result

Theorem 1

Given $n$ training samples $(X_1,Y_1),\cdots, (X_n, Y_n)$, consider the optimization problem in RKHS $\mathcal{H}$ with associated kernel $k$: where $L$ is a loss function depends on $n$ examples, $\psi:[0,\infty)\rightarrow\mathbb{R}$ is a strictly increasing function. All minimizers $f^*$ of this problem admit the following form:

Figures (5)

  • Figure 1: Overview of CLIP (modified from CLIP). In this figure, $F_1, \cdots, F_n$ are the image features extracted by the image encoder, $T_1, \cdots, T_n$ are the text features extracted by the text encoder.
  • Figure 2: The proposed framework. For training-free methods, the loss function is not computed. $\theta_\text{a}, \theta_\text{im}, \theta_\text{ker}, \theta_\text{log}$ denote the learnable parameters in acnhors computation, input image encoding, kernels computation and logits computation, respectively. Other notations: $I$ the input image, $\{I_i\}_{i=1}^N$ the labeled and unlabeled training images, $\mathcal{C}$ the class names, $\mathcal{A}$ the anchors, $X$ the input image's features, $k(X)$ the kernel vector of $X$, $V(\theta_\text{log})$ the transformation matrix, $L$ the loss function, $R(\theta)$ the regularization term, $\lambda$ the regularization weight and $Y$ the ground-truth label.
  • Figure 3: Flowchart of LP-CLIP, Zero-shot CLIP and CoOp. In (a), $A_1,\cdots,A_C$ denote the learnable classifier weights; in (b) and (c), they are text features extracted by the text encoder, which also serve as the classifier weights. $X$ denotes the extracted image features by image encoder.
  • Figure 4: Flowchart of Tip-Adapter and Tip-Adapter-F. For Tip-Adapter, the keys (i.e., the training image features) are frozen. For Tip-Adapter-F, the keys are learnable and initialized from the training image features.
  • Figure 5: Evolution of different parameter-efficient adaptation methods of CLIP. The texts around arrows represent the differences between source and destination method.

Theorems & Definitions (2)

  • Definition 1: PEFT of CVLMs for Low-shot Classification
  • Theorem 1: The Representer Theorem