Table of Contents
Fetching ...

Gradient-based Parameter Selection for Efficient Fine-Tuning

Zhi Zhang, Qizhe Zhang, Zijun Gao, Renrui Zhang, Ekaterina Shutova, Shiji Zhou, Shanghang Zhang

TL;DR

This work addresses the practical challenge of fine-tuning large pre-trained models by proposing Gradient-based Parameter Selection (GPS), a parameter-efficient method that tunes a tiny, task-specific subnetwork without adding any new parameters. GPS uses a gradient-based criterion to select top-K input connections per neuron, distributing updates across the entire network, and applies masked fine-tuning to update only the selected weights. The method is grounded in a sparse-regularization view and leverages a head-free supervised contrastive loss for gradient computation, yielding robust, architecture-agnostic performance gains on image classification benchmarks (FGVC, VTAB) and semantic segmentation tasks while reducing training cost. Overall, GPS establishes a practical, scalable approach for efficient transfer learning that maintains model integrity and achieves state-of-the-art results among PEFT methods.

Abstract

With the growing size of pre-trained models, full fine-tuning and storing all the parameters for various downstream tasks is costly and infeasible. In this paper, we propose a new parameter-efficient fine-tuning method, Gradient-based Parameter Selection (GPS), demonstrating that only tuning a few selected parameters from the pre-trained model while keeping the remainder of the model frozen can generate similar or better performance compared with the full model fine-tuning method. Different from the existing popular and state-of-the-art parameter-efficient fine-tuning approaches, our method does not introduce any additional parameters and computational costs during both the training and inference stages. Another advantage is the model-agnostic and non-destructive property, which eliminates the need for any other design specific to a particular model. Compared with the full fine-tuning, GPS achieves 3.33% (91.78% vs. 88.45%, FGVC) and 9.61% (73.1% vs. 65.57%, VTAB) improvement of the accuracy with tuning only 0.36% parameters of the pre-trained model on average over 24 image classification tasks; it also demonstrates a significant improvement of 17% and 16.8% in mDice and mIoU, respectively, on medical image segmentation task. Moreover, GPS achieves state-of-the-art performance compared with existing PEFT methods.

Gradient-based Parameter Selection for Efficient Fine-Tuning

TL;DR

This work addresses the practical challenge of fine-tuning large pre-trained models by proposing Gradient-based Parameter Selection (GPS), a parameter-efficient method that tunes a tiny, task-specific subnetwork without adding any new parameters. GPS uses a gradient-based criterion to select top-K input connections per neuron, distributing updates across the entire network, and applies masked fine-tuning to update only the selected weights. The method is grounded in a sparse-regularization view and leverages a head-free supervised contrastive loss for gradient computation, yielding robust, architecture-agnostic performance gains on image classification benchmarks (FGVC, VTAB) and semantic segmentation tasks while reducing training cost. Overall, GPS establishes a practical, scalable approach for efficient transfer learning that maintains model integrity and achieves state-of-the-art results among PEFT methods.

Abstract

With the growing size of pre-trained models, full fine-tuning and storing all the parameters for various downstream tasks is costly and infeasible. In this paper, we propose a new parameter-efficient fine-tuning method, Gradient-based Parameter Selection (GPS), demonstrating that only tuning a few selected parameters from the pre-trained model while keeping the remainder of the model frozen can generate similar or better performance compared with the full model fine-tuning method. Different from the existing popular and state-of-the-art parameter-efficient fine-tuning approaches, our method does not introduce any additional parameters and computational costs during both the training and inference stages. Another advantage is the model-agnostic and non-destructive property, which eliminates the need for any other design specific to a particular model. Compared with the full fine-tuning, GPS achieves 3.33% (91.78% vs. 88.45%, FGVC) and 9.61% (73.1% vs. 65.57%, VTAB) improvement of the accuracy with tuning only 0.36% parameters of the pre-trained model on average over 24 image classification tasks; it also demonstrates a significant improvement of 17% and 16.8% in mDice and mIoU, respectively, on medical image segmentation task. Moreover, GPS achieves state-of-the-art performance compared with existing PEFT methods.
Paper Structure (71 sections, 9 equations, 13 figures, 13 tables)

This paper contains 71 sections, 9 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Comparison between our GPS and other PEFT methods. (a) Exiting popular methods introduce extra parameters for tuning downstream tasks, which might need a special design for diverse architectures, such as appending prompt into the input token in Transformer or inserting different modules into different layers (b) Our approach avoids the introduction of additional parameters and solely fine-tunes the selected parameters from the model, employing a unified gradient-based parameter selection method across diverse architectural variations, e.g. Transformer and CNN.
  • Figure 2: Performance comparisons of 11 fine-tuning methods with a pre-trained ViT-B/16 model on the VTAB-1k (a) and FGVC (b) benchmarks. Our GPS (red stars) achieves state-of-the-art performance on both benchmarks with only 0.25% and 0.77% average trainable parameters respectively.
  • Figure 3: The overall pipeline of GPS. We first select a small portion of important parameters (sub-network) for each task from the original pre-trained model using a gradient-based method. Then only fine-tune the sub-network while keeping other parameters frozen.
  • Figure 4: Computational cost of different tuning methods. From left to right: training time, training memory, test time, and test memory. Training/Test time is the time consumed by a mini-batch.
  • Figure 5: The Visualization of Polyp segmentation task. Our GPS can still handle difficult segmentation cases compared with others.
  • ...and 8 more figures