From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications
Ajay Jaiswal, Yifan Wang, Lu Yin, Shiwei Liu, Runjin Chen, Jiawei Zhao, Ananth Grama, Yuandong Tian, Zhangyang Wang
TL;DR
This work investigates why LLM weight matrices exhibit non-uniform low-rank structure by framing weight evolution through gradient-subspace stabilization detected via Hessian eigenspace analysis. It establishes a theoretical link between gradient dynamics and Hessian structure, identifying layer- and component-specific Hessian gaps and correlating them with low-rank emergence. The authors introduce WeLore, a data-agnostic, layer-adaptive framework that unifies low-rank compression (WeLore-COMP) and parameter-efficient finetuning (WeLore-PEFT) by differentiating Low-rank Components (LRCs) from Non-Low-rank Components (N-LRCs). Empirical results across LLaMa-2 and Mistral models demonstrate substantial memory and compute savings with minimal to no loss in performance, often surpassing full-finetuning or LoRA-style baselines on downstream tasks. The work advances practical green-AI strategies for efficient training and deployment of large transformers by leveraging non-uniform low-rank structures grounded in gradient-Hessian dynamics.
Abstract
Large Language Models' (LLMs) weight matrices can often be expressed in low-rank form with potential to relax memory and compute resource requirements. Unlike prior efforts that focus on developing novel matrix decompositions, in this work we study the non-uniform low-rank properties of weight matrices in LLMs through the lens of stabilizing gradient subspace. First, we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. Second, we empirically establish an important relationship between gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structures, necessitating variable rank reduction across them to minimize drop in performance due to compression. Drawing on this result, we present Weight Low-Rank Projection(WeLore) that unifies weight compression and memory-efficient fine-tuning into one, in a data-agnostic and one-shot manner. When used as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) and suitably encodes them for minimum performance loss. Our gradient dynamics perspective illustrates that LRCs tend to have better fine-tuning capabilities and their standalone fine-tuning can closely mimic and sometimes outperform the training loss trajectory and performance of full fine-tuning with notable memory and compute footprint reduction. Codes are available at https://github.com/VITA-Group/WeLore.
