Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey and Benchmark

Yi Xin; Jianjiang Yang; Siqi Luo; Yuntao Du; Qi Qin; Kangrui Cen; Yangfan He; Zhiwei Zhang; Bin Fu; Xiaokang Yang; Guangtao Zhai; Ming-Hsuan Yang; Xiaohong Liu

Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey and Benchmark

Yi Xin, Jianjiang Yang, Siqi Luo, Yuntao Du, Qi Qin, Kangrui Cen, Yangfan He, Zhiwei Zhang, Bin Fu, Xiaokang Yang, Guangtao Zhai, Ming-Hsuan Yang, Xiaohong Liu

TL;DR

The paper addresses the impracticality of full fine-tuning for large vision foundation models and surveys parameter-efficient fine-tuning (PEFT) techniques across vision tasks. It categorizes PEFT into addition-, partial-, unified-, and multi-task-tuning, and introduces the V-PEFT Bench with a PPT metric to standardize evaluation. The work provides extensive task/dataset coverage (image, video, dense prediction) and reports benchmark results showing PEFT generally offers favorable efficiency-performance trade-offs, with domain and data scale influencing gains. It concludes with strategies for explainability, PEFT in generative models, hyperparameter simplification, and privacy considerations, aiming to catalyze practical adoption and further research.

Abstract

Pre-trained vision models (PVMs) have demonstrated remarkable adaptability across a wide range of downstream vision tasks, showcasing exceptional performance. However, as these models scale to billions or even trillions of parameters, conventional full fine-tuning has become increasingly impractical due to its high computational and storage demands. To address these challenges, parameter-efficient fine-tuning (PEFT) has emerged as a promising alternative, aiming to achieve performance comparable to full fine-tuning while making minimal adjustments to the model parameters. This paper presents a comprehensive survey of the latest advancements in the visual PEFT field, systematically reviewing current methodologies and categorizing them into four primary categories: addition-based, partial-based, unified-based, and multi-task tuning. In addition, this paper offers an in-depth analysis of widely used visual datasets and real-world applications where PEFT methods have been successfully applied. Furthermore, this paper introduces the V-PEFT Bench, a unified benchmark designed to standardize the evaluation of PEFT methods across a diverse set of vision tasks, ensuring consistency and fairness in comparison. Finally, the paper outlines potential directions for future research to propel advances in the PEFT field. A comprehensive collection of resources is available at https://github.com/synbol/Awesome-Parameter-Efficient-Transfer-Learning.

Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey and Benchmark

TL;DR

Abstract

Paper Structure (49 sections, 13 equations, 7 figures, 6 tables)

This paper contains 49 sections, 13 equations, 7 figures, 6 tables.

Introduction
Preliminaries
Problem Definition
Vision Transformer
Diffusion Model
Pre-training
Methodology
Addition-based Tuning Methods
Adapter Tuning
Prompt Tuning
Prefix Tuning
Side Tuning
Partial-based Tuning Methods
Specification Tuning
Reparameterization Tuning
...and 34 more sections

Figures (7)

Figure 1: Representative Vision Foundation Models and Pre-Training Methods. Our analysis primarily focuses on the significant advancements made between 2020 and 2024. Notably, models highlighted in orange represent diffusion models.
Figure 2: Taxonomy of Parameter-Efficient Fine-Tuning Methods for Pre-Trained Vision Models. Existing PEFT methods can be divided into 4 primary categories: Addition-based Tuning, which involves the integration of additional trainable neural modules or parameters into the PVMs; Partial-based Tuning, which focuses on selectively fine-tuning specific parameters within the original PVMs; Unified-based Tuning, which seeks to consolidate various PEFT approaches or incorporate other techniques; Multi-task Tuning, which emphasizes the synergy and complementary relationships between multiple tasks.
Figure 3: Detailed Architecture of PEFT Methods. It covers the key components of different tuning strategies, including Adapter Tuning, Prompt Tuning, Prefix Tuning, Side Tuning, and Reparameter Tuning.
Figure 4: Comparison Between Single Task Tuning and Multi-Task Tuning. (a) In single task tuning, each task is equipped with its own specific parameters, creating isolated and parallel execution paths for each task. This approach ensures task independence but lacks shared knowledge across tasks. (b) In multi-task tuning, tasks not only maintain their task-specific parameters but also leverage a set of shared parameters. This manner can facilitate the extraction of both task-shared and task-specific knowledge.
Figure 5: Overview of the V-PEFT Bench Codebase Architecture.The Codebase comprises four modular layers: Core, which includes datasets, data loaders, model architectures (e.g., ViT, Swin), and training hooks; Algorithm, covering base modules, PETL methods (e.g., Adapter, LoRA, Prefix-Tuning), and utilities such as loss functions and hooks; Extension, offering tools for attention map visualization and feature distribution analysis; and API, which supports configuration, training, evaluation, and scripting. This architecture enables flexible experimentation and seamless integration of PEFT techniques.
...and 2 more figures

Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey and Benchmark

TL;DR

Abstract

Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey and Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (7)