Table of Contents
Fetching ...

Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves

Shihan Wu, Ji Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Heng Tao Shen

TL;DR

This work interrogates the common practice of freezing pre-trained vision-language models during prompt tuning and shows that doing so does not meaningfully improve efficiency or transferability. By analyzing feature-gradient propagation flows, the authors introduce Skip Tuning, which combines Layer-wise Skipping and Class-wise Skipping to reduce both the length and width of gradient paths during full fine-tuning without adding prompts or adapters. Across a broad suite of benchmarks, Skip Tuning delivers superior effectiveness and markedly better memory and time efficiency compared to both prompt-tuning and adapter-based methods. The approach demonstrates robust performance under base-to-new, cross-dataset, domain generalization, and few-shot settings, offering a practical avenue for efficient adaptation of large vision-language models to diverse tasks.

Abstract

Prompt tuning (PT) has long been recognized as an effective and efficient paradigm for transferring large pre-trained vision-language models (VLMs) to downstream tasks by learning a tiny set of context vectors. Nevertheless, in this work, we reveal that freezing the parameters of VLMs during learning the context vectors neither facilitates the transferability of pre-trained knowledge nor improves the memory and time efficiency significantly. Upon further investigation, we find that reducing both the length and width of the feature-gradient propagation flows of the full fine-tuning (FT) baseline is key to achieving effective and efficient knowledge transfer. Motivated by this, we propose Skip Tuning, a novel paradigm for adapting VLMs to downstream tasks. Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules. Extensive experiments across a wide spectrum of benchmarks demonstrate the superior effectiveness and efficiency of our Skip Tuning over both PT and adapter-based methods. Code: https://github.com/Koorye/SkipTuning.

Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves

TL;DR

This work interrogates the common practice of freezing pre-trained vision-language models during prompt tuning and shows that doing so does not meaningfully improve efficiency or transferability. By analyzing feature-gradient propagation flows, the authors introduce Skip Tuning, which combines Layer-wise Skipping and Class-wise Skipping to reduce both the length and width of gradient paths during full fine-tuning without adding prompts or adapters. Across a broad suite of benchmarks, Skip Tuning delivers superior effectiveness and markedly better memory and time efficiency compared to both prompt-tuning and adapter-based methods. The approach demonstrates robust performance under base-to-new, cross-dataset, domain generalization, and few-shot settings, offering a practical avenue for efficient adaptation of large vision-language models to diverse tasks.

Abstract

Prompt tuning (PT) has long been recognized as an effective and efficient paradigm for transferring large pre-trained vision-language models (VLMs) to downstream tasks by learning a tiny set of context vectors. Nevertheless, in this work, we reveal that freezing the parameters of VLMs during learning the context vectors neither facilitates the transferability of pre-trained knowledge nor improves the memory and time efficiency significantly. Upon further investigation, we find that reducing both the length and width of the feature-gradient propagation flows of the full fine-tuning (FT) baseline is key to achieving effective and efficient knowledge transfer. Motivated by this, we propose Skip Tuning, a novel paradigm for adapting VLMs to downstream tasks. Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules. Extensive experiments across a wide spectrum of benchmarks demonstrate the superior effectiveness and efficiency of our Skip Tuning over both PT and adapter-based methods. Code: https://github.com/Koorye/SkipTuning.

Paper Structure

This paper contains 16 sections, 7 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Comparison of our devised Skip Tuning with state-of-the-art prompt tuning methods in terms of training time (seconds), memory cost (M), and classification accuracy (%) across base-to-new generalization, cross-dataset generalization, domain generalization, and few-shot learning benchmarks. $\times$ indicates the performance improvement over the state-of-the-art. Comparison results with the adapter-based methods are reported in Table \ref{['tab:adapter']}.
  • Figure 2: Motivations. (a) Comparison between the prompt tuning (PT) method CoOp zhou2022coop and the full fine-tuning (FT) baseline in terms of i) the number of learnable parameters, ii) memory usage, iii) time cost, and iv) base-to-new generalization performance. (b) Feature Sensitivity (FS) of CLIP's network layers, averaged over 100 randomly-sampled training images. (c) Gradient Dependence (GD) of class tokens for different training images.
  • Figure 3: Overview of our proposed Skip Tuning. Skip Tuning performs Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip)) to enhance the memory and time efficiency of the FT baseline. Specifically, LSkip reduces the length of feature-gradient propagation flows (FGPFs) by caching intermediate features produced by the $\omega$-th layers of CLIP's vision encoder $E_V$ and text encoder $E_T$ before FT begins. In contrast, CSkip reduces the width of FGPFs by filtering out unimportant class tokens in the text encoder $E_T$ for every training image.
  • Figure 4: Ablation studies of the number of skipped layers $\omega$ in LSkip, and the sampling rate $r$, the decay coefficient $\lambda$ in CSkip.
  • Figure 5: Few-shot learning results on 11 datasets. For detailed results, please visit Sup. Mat. (D)