Table of Contents
Fetching ...

From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications

Ajay Jaiswal, Yifan Wang, Lu Yin, Shiwei Liu, Runjin Chen, Jiawei Zhao, Ananth Grama, Yuandong Tian, Zhangyang Wang

TL;DR

This work investigates why LLM weight matrices exhibit non-uniform low-rank structure by framing weight evolution through gradient-subspace stabilization detected via Hessian eigenspace analysis. It establishes a theoretical link between gradient dynamics and Hessian structure, identifying layer- and component-specific Hessian gaps and correlating them with low-rank emergence. The authors introduce WeLore, a data-agnostic, layer-adaptive framework that unifies low-rank compression (WeLore-COMP) and parameter-efficient finetuning (WeLore-PEFT) by differentiating Low-rank Components (LRCs) from Non-Low-rank Components (N-LRCs). Empirical results across LLaMa-2 and Mistral models demonstrate substantial memory and compute savings with minimal to no loss in performance, often surpassing full-finetuning or LoRA-style baselines on downstream tasks. The work advances practical green-AI strategies for efficient training and deployment of large transformers by leveraging non-uniform low-rank structures grounded in gradient-Hessian dynamics.

Abstract

Large Language Models' (LLMs) weight matrices can often be expressed in low-rank form with potential to relax memory and compute resource requirements. Unlike prior efforts that focus on developing novel matrix decompositions, in this work we study the non-uniform low-rank properties of weight matrices in LLMs through the lens of stabilizing gradient subspace. First, we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. Second, we empirically establish an important relationship between gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structures, necessitating variable rank reduction across them to minimize drop in performance due to compression. Drawing on this result, we present Weight Low-Rank Projection(WeLore) that unifies weight compression and memory-efficient fine-tuning into one, in a data-agnostic and one-shot manner. When used as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) and suitably encodes them for minimum performance loss. Our gradient dynamics perspective illustrates that LRCs tend to have better fine-tuning capabilities and their standalone fine-tuning can closely mimic and sometimes outperform the training loss trajectory and performance of full fine-tuning with notable memory and compute footprint reduction. Codes are available at https://github.com/VITA-Group/WeLore.

From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications

TL;DR

This work investigates why LLM weight matrices exhibit non-uniform low-rank structure by framing weight evolution through gradient-subspace stabilization detected via Hessian eigenspace analysis. It establishes a theoretical link between gradient dynamics and Hessian structure, identifying layer- and component-specific Hessian gaps and correlating them with low-rank emergence. The authors introduce WeLore, a data-agnostic, layer-adaptive framework that unifies low-rank compression (WeLore-COMP) and parameter-efficient finetuning (WeLore-PEFT) by differentiating Low-rank Components (LRCs) from Non-Low-rank Components (N-LRCs). Empirical results across LLaMa-2 and Mistral models demonstrate substantial memory and compute savings with minimal to no loss in performance, often surpassing full-finetuning or LoRA-style baselines on downstream tasks. The work advances practical green-AI strategies for efficient training and deployment of large transformers by leveraging non-uniform low-rank structures grounded in gradient-Hessian dynamics.

Abstract

Large Language Models' (LLMs) weight matrices can often be expressed in low-rank form with potential to relax memory and compute resource requirements. Unlike prior efforts that focus on developing novel matrix decompositions, in this work we study the non-uniform low-rank properties of weight matrices in LLMs through the lens of stabilizing gradient subspace. First, we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. Second, we empirically establish an important relationship between gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structures, necessitating variable rank reduction across them to minimize drop in performance due to compression. Drawing on this result, we present Weight Low-Rank Projection(WeLore) that unifies weight compression and memory-efficient fine-tuning into one, in a data-agnostic and one-shot manner. When used as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) and suitably encodes them for minimum performance loss. Our gradient dynamics perspective illustrates that LRCs tend to have better fine-tuning capabilities and their standalone fine-tuning can closely mimic and sometimes outperform the training loss trajectory and performance of full fine-tuning with notable memory and compute footprint reduction. Codes are available at https://github.com/VITA-Group/WeLore.
Paper Structure (37 sections, 2 theorems, 33 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 37 sections, 2 theorems, 33 equations, 8 figures, 7 tables, 1 algorithm.

Key Result

Theorem 2.1

Let $H_t=\nabla^2 L(W_t)$ be the Hessian of the loss $L(W)$ at time $t$. Under standard assumptions, the eigenvalues and eigenspaces of $H_t$ stabilize as $t \rightarrow \infty$. Specifically: \begin{tikzpicture}[baseline=(char.base)]{ \node[shape=circle,draw,inner sep=0.4pt] (char) {1};

Figures (8)

  • Figure 1: Hessian gap across layers and components of LLaMA2-7B. We observe that: 1) mlp.up_proj, mlp.down_proj, and self_attn.v_proj exhibit less pronounced Hessian gaps compared to self_attn.k_proj, self_attn.q_proj, self_attn.o_proj, and mlp.gate_proj; 2) early and late layers generally display clearer gaps than middle layers; 3) components with a pronounced Hessian gap (self_attn.k_proj, self_attn.q_proj, self_attn.o_proj, and mlp.gate_proj) tend to be more low-rank, as shown in our experiments.
  • Figure 2: (Row 1) Gradients subspace similarity obtained from various checkpoints during pretraining of LLaMA-130M on C4 dataset for 25,000 training steps using Adam Optimizer. (Row 2) Emergence of Low-rank Weight Subspace during pretraining of LLaMA-130M. Each row of individual subplot represents the singular values of weights in a given training step.
  • Figure 3: Normalized singular values of the weight matrices corresponding to different layers of LLaMa-2 7B pretrained checkpoint from HuggingFace. Each subplot indicates 4096 sorted and normalized singular values corresponding to a layer (e.g.,self_attn.q_proj) from 32 transformer blocks.
  • Figure 4: Finetuning statistics and performance comparison of Low Rank Components (LRCs) and Non-Low-Rank Components (N-LRCs) layers of a 50% compressed LLaMa-2 7B model with C4. Note that all finetuning hyperparameters are kept same in both settings for fair comparison.
  • Figure 5: Continual-Finetuning statistics and performance comparison of a 50% low-rank compressed LLaMa-2 7B pretrained checkpoint. With exactly same hyperparamter configrations, WeLore-PEFT can outperform full-finetuning with merely $\sim$35% of trainable parameters, while providing $\sim$3$\times$ better throughput.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 2.1
  • proof : Proof Sketch
  • Theorem 2.2
  • proof : Proof Sketch