Table of Contents
Fetching ...

SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers

Viktoriia Chekalina, Anna Rudenko, Gleb Mezentsev, Alexander Mikhalev, Alexander Panchenko, Ivan Oseledets

TL;DR

This work proposes a new selective PEFT method, namely SparseGrad, that performs well on MLP blocks and transfers layer gradients to a space where only about 1% of the layer’s elements remain significant, and reduces the number of updated parameters.

Abstract

The performance of Transformer models has been enhanced by increasing the number of parameters and the length of the processed text. Consequently, fine-tuning the entire model becomes a memory-intensive process. High-performance methods for parameter-efficient fine-tuning (PEFT) typically work with Attention blocks and often overlook MLP blocks, which contain about half of the model parameters. We propose a new selective PEFT method, namely SparseGrad, that performs well on MLP blocks. We transfer layer gradients to a space where only about 1\% of the layer's elements remain significant. By converting gradients into a sparse structure, we reduce the number of updated parameters. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task. In these experiments, with identical memory requirements, our method outperforms LoRA and MeProp, robust popular state-of-the-art PEFT approaches.

SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers

TL;DR

This work proposes a new selective PEFT method, namely SparseGrad, that performs well on MLP blocks and transfers layer gradients to a space where only about 1% of the layer’s elements remain significant, and reduces the number of updated parameters.

Abstract

The performance of Transformer models has been enhanced by increasing the number of parameters and the length of the processed text. Consequently, fine-tuning the entire model becomes a memory-intensive process. High-performance methods for parameter-efficient fine-tuning (PEFT) typically work with Attention blocks and often overlook MLP blocks, which contain about half of the model parameters. We propose a new selective PEFT method, namely SparseGrad, that performs well on MLP blocks. We transfer layer gradients to a space where only about 1\% of the layer's elements remain significant. By converting gradients into a sparse structure, we reduce the number of updated parameters. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task. In these experiments, with identical memory requirements, our method outperforms LoRA and MeProp, robust popular state-of-the-art PEFT approaches.

Paper Structure

This paper contains 18 sections, 2 equations, 3 figures, 15 tables.

Figures (3)

  • Figure 1: The first row illustrates signal propagation in the original Linear Layer, while the second row illustrates propagation with the proposed SparseGradLinear layer.
  • Figure 2: Gradients on the 5-th BERT MLP: $U \frac{\partial{L}}{\partial{W^T}} V^T$ (right) is more sparse than the original $\frac{\partial{L}}{\partial{W^T}}$ (left).
  • Figure 3: Strided structure of $\frac{\partial{L}}{\partial{\tilde{Y}}}$ (left) and visualizations of $\%$ nonzero elements in $\frac{\partial{L}}{\partial{\tilde{Y}}}$ throughout training (right).