Table of Contents
Fetching ...

LoRA-drop: Efficient LoRA Parameter Pruning based on Output Evaluation

Hongyun Zhou, Xiangyu Lu, Wang Xu, Conghui Zhu, Tiejun Zhao, Muyun Yang

TL;DR

This work identifies that LoRA’s impact on model outputs is uneven across layers and tasks, motivating output-based pruning rather than solely parameter-focused criteria. It introduces LoRA-drop, a two-stage framework that first estimates per-layer importance from LoRA outputs via stratified sampling, and then prunes by retaining high-importance layers while sharing a common LoRA for the rest. Across NLP benchmarks (GLUE) and generation tasks (E2E, DART, DialogSum, GSM8K) with RoBERTa and Llama2-7b, LoRA-drop matches or surpasses full fine-tuning and standard LoRA while reducing trainable parameters by roughly 50%–68%. The approach demonstrates that LoRA output magnitude is a reliable indicator of layer importance and that sharing low-importance LoRA components preserves performance, offering a practical, data-adaptive path to more parameter-efficient fine-tuning.

Abstract

Low-Rank Adaptation (LoRA) is currently the most commonly used Parameter-efficient fine-tuning (PEFT) method, it introduces auxiliary parameters for each layer to fine-tune the pre-trained model under limited computing resources. However, it still faces resource consumption challenges during training when scaling up to larger models. Most previous studies have tackled this issue by using pruning techniques, which involve removing LoRA parameters deemed unimportant. Nonetheless, these efforts only analyze LoRA parameter features to evaluate their importance, such as parameter count, size, and gradient. In fact, the output of LoRA (product of LoRA parameter and hidden state), directly impacts the final results. Preliminary experiments indicate that a fraction of LoRA elements possesses significantly high output values, substantially influencing the layer output. Motivated by the observation, we propose LoRA-drop. Concretely, LoRA-drop evaluates the importance of LoRA based on the LoRA output. Then we retain LoRA for important layers and the other layers share the same LoRA. We conduct abundant experiments with models of different scales on NLU and NLG tasks. Results demonstrate that LoRA-drop can achieve performance comparable to full fine-tuning and LoRA, while retaining 50\% of the LoRA parameters on average.

LoRA-drop: Efficient LoRA Parameter Pruning based on Output Evaluation

TL;DR

This work identifies that LoRA’s impact on model outputs is uneven across layers and tasks, motivating output-based pruning rather than solely parameter-focused criteria. It introduces LoRA-drop, a two-stage framework that first estimates per-layer importance from LoRA outputs via stratified sampling, and then prunes by retaining high-importance layers while sharing a common LoRA for the rest. Across NLP benchmarks (GLUE) and generation tasks (E2E, DART, DialogSum, GSM8K) with RoBERTa and Llama2-7b, LoRA-drop matches or surpasses full fine-tuning and standard LoRA while reducing trainable parameters by roughly 50%–68%. The approach demonstrates that LoRA output magnitude is a reliable indicator of layer importance and that sharing low-importance LoRA components preserves performance, offering a practical, data-adaptive path to more parameter-efficient fine-tuning.

Abstract

Low-Rank Adaptation (LoRA) is currently the most commonly used Parameter-efficient fine-tuning (PEFT) method, it introduces auxiliary parameters for each layer to fine-tune the pre-trained model under limited computing resources. However, it still faces resource consumption challenges during training when scaling up to larger models. Most previous studies have tackled this issue by using pruning techniques, which involve removing LoRA parameters deemed unimportant. Nonetheless, these efforts only analyze LoRA parameter features to evaluate their importance, such as parameter count, size, and gradient. In fact, the output of LoRA (product of LoRA parameter and hidden state), directly impacts the final results. Preliminary experiments indicate that a fraction of LoRA elements possesses significantly high output values, substantially influencing the layer output. Motivated by the observation, we propose LoRA-drop. Concretely, LoRA-drop evaluates the importance of LoRA based on the LoRA output. Then we retain LoRA for important layers and the other layers share the same LoRA. We conduct abundant experiments with models of different scales on NLU and NLG tasks. Results demonstrate that LoRA-drop can achieve performance comparable to full fine-tuning and LoRA, while retaining 50\% of the LoRA parameters on average.
Paper Structure (25 sections, 3 equations, 12 figures, 8 tables)

This paper contains 25 sections, 3 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: The diagram of LoRA. LoRA influences the pre-trained model through its output $\Delta \bm{Wx}$. This paper's method measures the importance of LoRA based on its output.
  • Figure 2: The frequency distribution of the squared norm of query LoRA output $\Delta \bm{W}_i\bm{x}_i$ on the RTE task. Each subplot represents the distribution of $\| \Delta \bm{W}_i\bm{x}_i \|^2$ for query LoRA from layers 0 to 11, where the x-axis denotes the magnitude of $\| \Delta \bm{W}_i\bm{x}_i \|^2$ for different inputs $\bm{x}_i$, and the y-axis represents the frequency of $\| \Delta \bm{W}_i\bm{x}_i \|^2$.
  • Figure 3: The overall workflow of LoRA-drop.
  • Figure 4: LoRA Importance Distribution in Different Downstream Task Data. To unify the importance scales across different datasets, we divide the importance of each dataset by its maximum value so that the importance of the most important layer of LoRA in that dataset is 1.
  • Figure 5: Importance distribution of LoRA for query in RTE under different sample proportions. Each point on the heatmap represents the importance $I_{i}$ of the query LoRA in layer $i$ under $\alpha$ sample proportion.
  • ...and 7 more figures