Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

Ting Liu; Xuyang Liu; Liangtao Shi; Zunnan Xu; Yue Hu; Siteng Huang; Yi Xin; Bineng Zhong; Donglin Wang

Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

Ting Liu, Xuyang Liu, Liangtao Shi, Zunnan Xu, Yue Hu, Siteng Huang, Yi Xin, Bineng Zhong, Donglin Wang

TL;DR

Sparse-Tuning addresses the dual challenge of tuning and running large Vision Transformers efficiently by coupling token sparsification with Dense Adapters that propagate shallow-layer information to deeper layers. This design mitigates information loss from sparsified tokens and preserves accuracy while dramatically reducing computation and memory usage, achieving state-of-the-art performance on VTAB-1K and strong results on complete image/video datasets. The method scales to larger ViT models and generalizes to segmentation tasks, indicating broad practical impact for deploying large ViTs in resource-constrained settings. The work highlights a robust path toward jointly optimizing training efficiency and inference efficiency for vision transformers.

Abstract

Parameter-efficient fine-tuning (PEFT) has emerged as a popular solution for adapting pre-trained Vision Transformer (ViT) models to downstream applications by updating only a small subset of parameters. While current PEFT methods have achieved fine-tuning efficiency, they overlook the efficiency of computation and GPU memory during inference, falling short of practical requirements. To address this limitation, we propose Sparse-Tuning, an efficient and effective framework that leverages popular token sparsification (TS) techniques to reduce information redundancy in images and videos, thereby significantly improving computational and memory efficiency. However, TS often compromises performance due to inevitable information loss. To address this limitation, we further introduce Dense Adapters (DA) to compensate for the information losses incurred by token sparsification. DA integrates comprehensive token information from shallow layers into the retained tokens of deeper layers, ensuring minimal performance degradation. Through the integration of TS techniques and DA, Sparse-Tuning achieves a significant reduction in computation and memory overhead while maintaining performance. Empirical results on VTAB-1K, three image datasets, and two video datasets show that Sparse-Tuning reduces GFLOPs to 66\% of the original ViT-B while achieving state-of-the-art performance compared to full fine-tuning and other PEFT baselines.

Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

TL;DR

Abstract

Paper Structure (37 sections, 17 equations, 4 figures, 14 tables, 1 algorithm)

This paper contains 37 sections, 17 equations, 4 figures, 14 tables, 1 algorithm.

Introduction
Related Work
Token Sparsification for ViT
Parameter-efficient Fine-tuning
Methodology
Preliminaries
Vision Transformers
Adapter Tuning
Prominent Token Sparsification Methods
Combining PEFT with Token Sparsification
Sparse-Tuning for Efficient ViT Adaption
Token Sparsification with Dense Adapters
Dense Adapter Variants
Experiments
Experimental Setup
...and 22 more sections

Figures (4)

Figure 1: Comparisons of Sparse-Tuning with other mainstream PEFT methods on CIFAR-100 dataset. Sparse-Tuning enhances performance while remarkably reducing training and inference time, GPU memory consumption, and computational complexity, achieving both fine-tuning and inference efficiency of the pre-trained ViT.
Figure 2: Overall framework. We freeze the pre-trained ViT-B/16 and update the proposed Dense Adapters (DAs) to efficiently fine-tune the pre-trained ViT. By selectively adapting tokens to focus on informative regions, Sparse-Tuning significantly reduces the computational cost of redundant tokens, thereby enhancing efficiency during both fine-tuning and inference stages. Token sparsification is the process of removing redundant tokens.
Figure 3: Variants of Dense Adapters. We present two variants that integrate multi-level features from different encoder layers at (a) the input stage and (b) the output stage.
Figure 4: Comparison of attention maps between our method and other PEFT+TS combinations.

Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

TL;DR

Abstract

Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (4)