Table of Contents
Fetching ...

Sparse Matrix in Large Language Model Fine-tuning

Haoze He, Juncheng Billy Li, Xuan Jiang, Heather Miller

TL;DR

This work tackles the high cost and accuracy gap of fine-tuning large language models (LLMs) by introducing Sparse Matrix Tuning (SMT), a parameter-efficient fine-tuning method that selects and updates only the most influential gradient sub-matrices. SMT identifies these blocks during a warm-up phase by analyzing average absolute gradients within $l \times l$ blocks (with $l=256$) and then updates a small subset of blocks in Q, K, V while freezing the rest, thereby dramatically reducing backward, optimizer, and activation costs. Across LLaMA-family models on commonsense and arithmetic tasks, SMT consistently surpasses LoRA/DoRA while achieving large memory and speed advantages, including a reported 14.6× speedup over full fine-tuning and memory reductions enabling fine-tuning on consumer-grade GPUs. A key finding is that attention, particularly the V vector, carries the most task-relevant memories, and SMT’s performance does not exhibit the plateau observed in low-rank adapters as trainable parameter count grows. Overall, SMT provides a practical, scalable route to efficient and competitive fine-tuning of large language models with open-source code.

Abstract

LoRA and its variants have become popular parameter-efficient fine-tuning (PEFT) methods due to their ability to avoid excessive computational costs. However, an accuracy gap often exists between PEFT methods and full fine-tuning (FT), and this gap has yet to be systematically studied. In this work, we introduce a method for selecting sparse sub-matrices that aim to minimize the performance gap between PEFT vs. full fine-tuning (FT) while also reducing both fine-tuning computational cost and memory cost. Our Sparse Matrix Tuning (SMT) method begins by identifying the most significant sub-matrices in the gradient update, updating only these blocks during the fine-tuning process. In our experiments, we demonstrate that SMT consistently surpasses other PEFT baseline (e.g. LoRA and DoRA) in fine-tuning popular large language models such as LLaMA across a broad spectrum of tasks, while reducing the GPU memory footprint by 67% compared to FT. We also examine how the performance of LoRA and DoRA tends to plateau and decline as the number of trainable parameters increases, in contrast, our SMT method does not suffer from such issue.

Sparse Matrix in Large Language Model Fine-tuning

TL;DR

This work tackles the high cost and accuracy gap of fine-tuning large language models (LLMs) by introducing Sparse Matrix Tuning (SMT), a parameter-efficient fine-tuning method that selects and updates only the most influential gradient sub-matrices. SMT identifies these blocks during a warm-up phase by analyzing average absolute gradients within blocks (with ) and then updates a small subset of blocks in Q, K, V while freezing the rest, thereby dramatically reducing backward, optimizer, and activation costs. Across LLaMA-family models on commonsense and arithmetic tasks, SMT consistently surpasses LoRA/DoRA while achieving large memory and speed advantages, including a reported 14.6× speedup over full fine-tuning and memory reductions enabling fine-tuning on consumer-grade GPUs. A key finding is that attention, particularly the V vector, carries the most task-relevant memories, and SMT’s performance does not exhibit the plateau observed in low-rank adapters as trainable parameter count grows. Overall, SMT provides a practical, scalable route to efficient and competitive fine-tuning of large language models with open-source code.

Abstract

LoRA and its variants have become popular parameter-efficient fine-tuning (PEFT) methods due to their ability to avoid excessive computational costs. However, an accuracy gap often exists between PEFT methods and full fine-tuning (FT), and this gap has yet to be systematically studied. In this work, we introduce a method for selecting sparse sub-matrices that aim to minimize the performance gap between PEFT vs. full fine-tuning (FT) while also reducing both fine-tuning computational cost and memory cost. Our Sparse Matrix Tuning (SMT) method begins by identifying the most significant sub-matrices in the gradient update, updating only these blocks during the fine-tuning process. In our experiments, we demonstrate that SMT consistently surpasses other PEFT baseline (e.g. LoRA and DoRA) in fine-tuning popular large language models such as LLaMA across a broad spectrum of tasks, while reducing the GPU memory footprint by 67% compared to FT. We also examine how the performance of LoRA and DoRA tends to plateau and decline as the number of trainable parameters increases, in contrast, our SMT method does not suffer from such issue.
Paper Structure (19 sections, 3 equations, 9 figures, 6 tables)

This paper contains 19 sections, 3 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Differences between low-rank adaption method LoRA and SMT. Upper picture dedicates adaption approach in LoRA and lower picture represents the sub-matrices sparsity approach in SMT.
  • Figure 2: (a) A sparse weight matrix $W$. The green sub-matrices with significant gradients can be updated. (b) Backward propagation calculation for partial gradient for weight matrix $w$. (c) Computation graph in auto-differential systems.
  • Figure 3: Accuracy comparison of LoRA, DoRA, and SMT under different scaling of trainable parameters on commonsense reasoning datasets.
  • Figure 3: Accuracy comparison of LoRA, DoRA, and SMT under different scaling of trainable parameters on commonsense reasoning datasets. Given certain base model and PEFT method, we gradually increase the number of trainable parameters on each line from left to right. On each line, the best performing model has $^*$.
  • Figure 4: A visualization of trainable Q, K, V layers when fine-tuning 0.86% trainable parameters on LLaMA-7B. LLaMA-7B has 32 layers of MLPs, each contains a Q vector, a K vector, and a V vector. White layers are frozen and green layers contain trainable parameters.
  • ...and 4 more figures