Table of Contents
Fetching ...

PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning

Xin Yu, Cong Xie, Ziyu Zhao, Tiantian Fan, Lingzhou Xue, Zhi Zhang

TL;DR

This work tackles the limitation of LoRA in representing full fine-tuning capacity by starting from an over-parameterized low-rank space and applying gradient-based structured pruning to end with a compact, expressive adapter. It introduces PrunedLoRA, which jointly prunes LoRA submatrices $A$ and $B$ via a Hessian-informed mask and second-order updates, enabling adaptive rank reduction during fine-tuning. The authors provide a theoretical analysis showing gradient-based pruning is more robust to weight perturbations than activation-based pruning in a toy self-attention setting and demonstrate that this approach yields consistent empirical gains across mathematical reasoning and natural language understanding tasks, across sparsity levels from $50\%$ to $93\%$. Overall, PrunedLoRA narrows the gap between LoRA and full fine-tuning while maintaining inference efficiency, with stronger performance when initialized with higher ranks and pruned gradually.

Abstract

Low-rank adaptation (LoRA) has become a widely used paradigm for parameter-efficient fine-tuning of large language models, yet its representational capacity often lags behind full fine-tuning. Within the context of LoRA, a key open question is how to obtain expressive low-rank adapters from over-parameterized spaces. We propose \textit{PrunedLoRA}, a new framework that leverages structured pruning to obtain highly representative low-rank adapters from an over-parameterized initialization. Unlike prior approaches that impose a fixed low-rank budget, PrunedLoRA dynamically prunes less important components during fine-tuning and prevents their reactivation, enabling flexible and adaptive rank allocation. For structured pruning, by minimizing the pruning error for overall loss, we provide fine-grained pruning and recovery updates in a gradient-based pruning strategy with grounded interpretation. We provide the first theoretical analysis of the robustness of structured pruning and provably show that under the impact of weight perturbation, gradient-based pruning is more robust than activation-based pruning with respect to overall loss. Empirically, PrunedLoRA consistently outperforms LoRA and its variants across supervised fine-tuning tasks in mathematical reasoning, code generation, and natural language understanding, and it also demonstrates advantages over existing structured pruning methods across diverse sparsity levels.

PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning

TL;DR

This work tackles the limitation of LoRA in representing full fine-tuning capacity by starting from an over-parameterized low-rank space and applying gradient-based structured pruning to end with a compact, expressive adapter. It introduces PrunedLoRA, which jointly prunes LoRA submatrices and via a Hessian-informed mask and second-order updates, enabling adaptive rank reduction during fine-tuning. The authors provide a theoretical analysis showing gradient-based pruning is more robust to weight perturbations than activation-based pruning in a toy self-attention setting and demonstrate that this approach yields consistent empirical gains across mathematical reasoning and natural language understanding tasks, across sparsity levels from to . Overall, PrunedLoRA narrows the gap between LoRA and full fine-tuning while maintaining inference efficiency, with stronger performance when initialized with higher ranks and pruned gradually.

Abstract

Low-rank adaptation (LoRA) has become a widely used paradigm for parameter-efficient fine-tuning of large language models, yet its representational capacity often lags behind full fine-tuning. Within the context of LoRA, a key open question is how to obtain expressive low-rank adapters from over-parameterized spaces. We propose \textit{PrunedLoRA}, a new framework that leverages structured pruning to obtain highly representative low-rank adapters from an over-parameterized initialization. Unlike prior approaches that impose a fixed low-rank budget, PrunedLoRA dynamically prunes less important components during fine-tuning and prevents their reactivation, enabling flexible and adaptive rank allocation. For structured pruning, by minimizing the pruning error for overall loss, we provide fine-grained pruning and recovery updates in a gradient-based pruning strategy with grounded interpretation. We provide the first theoretical analysis of the robustness of structured pruning and provably show that under the impact of weight perturbation, gradient-based pruning is more robust than activation-based pruning with respect to overall loss. Empirically, PrunedLoRA consistently outperforms LoRA and its variants across supervised fine-tuning tasks in mathematical reasoning, code generation, and natural language understanding, and it also demonstrates advantages over existing structured pruning methods across diverse sparsity levels.

Paper Structure

This paper contains 21 sections, 2 theorems, 26 equations, 4 figures, 8 tables, 2 algorithms.

Key Result

Proposition 1

Suppose that, under activation-based and gradient-based pruning strategies, each module in a single attention module satisfies a given perturbation error. The error in the loss function would be linear w.r.t. perturbation error under different pruning strategies, but the error of activation-based me

Figures (4)

  • Figure 1: Performance of standard LoRA hu2022lora on GSM8K cobbe2021training with different ranks compared to full fine-tuning. Note that the method of full fine-tuning does not involve the initial rank, and we draw a red line here solely for comparison.
  • Figure 2: Left: schematic of the dynamic pruning process, where the gradient and estimated Hessian will determine pruned columns and update as shown in Algorithm \ref{['alg:gsp-obs']}. Right: design of PrunedLoRA, where both adapter matrices ${\bm{A}}$ and ${\bm{B}}$ are jointly pruned under a masking scheme.
  • Figure 3: GSM8K accuracy of different pruning methods (SparseGPT, LLM-Pruner, and PrunedLoRA) under various initialization ranks $r \in {64,128,256,512}$ and target ranks ${8,16,32}$. Each subfigure reports performance when starting from a specific initialization rank.
  • Figure : Comparison of trainable parameter ratios (before and after pruning) and training time across different fine-tuning methods.

Theorems & Definitions (3)

  • Proposition 1: Unofficial Statement
  • Definition 1: $\varepsilon$-Perturbation Error
  • Proposition 2