Visual Tuning

Bruce X. B. Yu; Jianlong Chang; Haixin Wang; Lingbo Liu; Shijie Wang; Zhiyu Wang; Junfan Lin; Lingxi Xie; Haojie Li; Zhouchen Lin; Qi Tian; Chang Wen Chen

Visual Tuning

Bruce X. B. Yu, Jianlong Chang, Haixin Wang, Lingbo Liu, Shijie Wang, Zhiyu Wang, Junfan Lin, Lingxi Xie, Haojie Li, Zhouchen Lin, Qi Tian, Chang Wen Chen

TL;DR

This survey analyzes visual tuning as a parameter-efficient transfer-learning paradigm for large pre-trained visual and visual-language models. It classifies techniques into five families—fine-tuning, prompt tuning, adapter tuning, parameter tuning, and remapping tuning—and explains their mechanisms, strengths, and limitations. The work grounds these methods in three theoretical lenses (biological, model-scale, and statistical) and formal notation, and it situates them within pre-training strategies and common architectures like CNNs and Transformers. It then outlines practical future directions, including advanced pre-training data, scalable optimization, interpretable prompts, and diversified interactions, to enable efficient deployment on edge devices and broad-domain tasks. Overall, the paper offers a comprehensive roadmap for leveraging PETL to maximize the utility of increasingly large vision foundations while managing memory and computing constraints.

Abstract

Fine-tuning visual models has been widely shown promising performance on many downstream visual tasks. With the surprising development of pre-trained visual foundation models, visual tuning jumped out of the standard modus operandi that fine-tunes the whole pre-trained model or just the fully connected layer. Instead, recent advances can achieve superior performance than full-tuning the whole pre-trained parameters by updating far fewer parameters, enabling edge devices and downstream applications to reuse the increasingly large foundation models deployed on the cloud. With the aim of helping researchers get the full picture and future directions of visual tuning, this survey characterizes a large and thoughtful selection of recent works, providing a systematic and comprehensive overview of existing work and models. Specifically, it provides a detailed background of visual tuning and categorizes recent visual tuning techniques into five groups: prompt tuning, adapter tuning, parameter tuning, and remapping tuning. Meanwhile, it offers some exciting research directions for prospective pre-training and various interactions in visual tuning.

Visual Tuning

TL;DR

Abstract

Paper Structure (43 sections, 17 equations, 5 figures, 2 tables)

This paper contains 43 sections, 17 equations, 5 figures, 2 tables.

Introduction
Background
Theories
Biological Perspective
Model Perspective
Statistical Perspective
Notation and Definition
Model Architecture
Model Pre-training
Model Tuning
Visual Tuning
Fine-tuning
Prompt Tuning
Vision-driven Prompt
Language-driven Prompt
...and 28 more sections

Figures (5)

Figure 1: Illustration of visual tuning. A pre-trained foundation model can accumulate knowledge via various pre-training techniques by scaling up in terms of model size, data modalities, tasks, etc. Given the pre-trained model, the focus of this survey is visual tuning, showing how to effectively reuse the knowledge of the pre-trained models by concerning important aspects such as tuned parameters, generalization ability, data efficacy, training memory, and inference memory, etc.
Figure 2: Three different types of prompt methods. The red and blue parts are tunable and frozen parameters, respectively.
Figure 3: Three different types of adapter methods. Red and blue parts are tunable and frozen parameters, respectively.
Figure 4: Three types of parameter tuning. Red and blue parts are tunable and frozen parameters, respectively.
Figure 5: Three different types of remapping tuning methods. Red and blue parts are tunable and frozen parameters, respectively.

Visual Tuning

TL;DR

Abstract

Visual Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)