Table of Contents
Fetching ...

Expanding Sparse Tuning for Low Memory Usage

Shufan Shen, Junshu Sun, Xiangyang Ji, Qingming Huang, Shuhui Wang

TL;DR

A method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage is proposed, which decomposes the tunable matrix for sparsification into two learnable low-rank matrices, saving from the costly storage of the whole original matrix.

Abstract

Parameter-efficient fine-tuning (PEFT) is an effective method for adapting pre-trained vision models to downstream tasks by tuning a small subset of parameters. Among PEFT methods, sparse tuning achieves superior performance by only adjusting the weights most relevant to downstream tasks, rather than densely tuning the whole weight matrix. However, this performance improvement has been accompanied by increases in memory usage, which stems from two factors, i.e., the storage of the whole weight matrix as learnable parameters in the optimizer and the additional storage of tunable weight indexes. In this paper, we propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage. To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices, saving from the costly storage of the whole original matrix. A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes. To maintain the effectiveness of sparse tuning with low-rank matrices, we extend the low-rank decomposition by applying nonlinear kernel functions to the whole-matrix merging. Consequently, we gain an increase in the rank of the merged matrix, enhancing the ability of SNELL in adapting the pre-trained models to downstream tasks. Extensive experiments on multiple downstream tasks show that SNELL achieves state-of-the-art performance with low memory usage, endowing PEFT with sparse tuning to large-scale models. Codes are available at https://github.com/ssfgunner/SNELL.

Expanding Sparse Tuning for Low Memory Usage

TL;DR

A method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage is proposed, which decomposes the tunable matrix for sparsification into two learnable low-rank matrices, saving from the costly storage of the whole original matrix.

Abstract

Parameter-efficient fine-tuning (PEFT) is an effective method for adapting pre-trained vision models to downstream tasks by tuning a small subset of parameters. Among PEFT methods, sparse tuning achieves superior performance by only adjusting the weights most relevant to downstream tasks, rather than densely tuning the whole weight matrix. However, this performance improvement has been accompanied by increases in memory usage, which stems from two factors, i.e., the storage of the whole weight matrix as learnable parameters in the optimizer and the additional storage of tunable weight indexes. In this paper, we propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage. To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices, saving from the costly storage of the whole original matrix. A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes. To maintain the effectiveness of sparse tuning with low-rank matrices, we extend the low-rank decomposition by applying nonlinear kernel functions to the whole-matrix merging. Consequently, we gain an increase in the rank of the merged matrix, enhancing the ability of SNELL in adapting the pre-trained models to downstream tasks. Extensive experiments on multiple downstream tasks show that SNELL achieves state-of-the-art performance with low memory usage, endowing PEFT with sparse tuning to large-scale models. Codes are available at https://github.com/ssfgunner/SNELL.

Paper Structure

This paper contains 31 sections, 8 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: (a) The high memory usage of sparse tuning arises from taking the whole weight matrix as learnable parameters, in addition to the storage of the tunable weight indexes (typically represented as a binary mask). (b) Our framework (SNELL) only stores the learnable low-rank matrices in the optimizer. (c) Memory usage comparison on pre-trained models with different depths.
  • Figure 2: Overview of our SNELL strategy. Given two learnable low-rank matrices, we merge them using a non-linear kernel function (left). This merging process is equivalent to mapping the matrices to higher-rank matrices and then performing matrix multiplication. Then we sparsified this merged adaptation matrix using a competition-based sparsification mechanism (right). This mechanism zeros out weights with small absolute values based on the specified percentage of $s$.
  • Figure 3: (a) Accuracy vs. memory usage (batchsize=64) with supervised pre-trained ViT-B/16 on VTAB-1k. (b) Memory usage evolutions of full fine-tuning, SNELL, and SNELL storing the merged adaptation matrix (SNELL storing $\Delta \mathbf{W}$) on ViT-H/14 during fine-tuning (batchsize=8). (c) Model parameter volumes vs. memory usage (batchsize=8). As the model gets larger, SNELL's advantage of low memory usage over full fine-tuning becomes more obvious.
  • Figure 4: (a) The fitting ability of different kernel functions. We fit random sparse matrices by merging two learnable low-rank matrices with different kernel functions and compute the MSE loss. (b) Performance comparison on groups of datasets in VTAB-1k. (c) Training loss on CIFAR-100 dataset in VTAB-1k benchmark of kernelized LoRA with different kernel functions.
  • Figure 5: The optimal sparsity ratio of SNELL-8 on different downstream tasks (left) and the average optimal sparsity ratio within each group (right) in VTAB-1k benchmark. The pre-trained model is the ConvNeXt-B pre-trained on ImageNet-21k.
  • ...and 3 more figures