Gradient Weight-normalized Low-rank Projection for Efficient LLM Training

Jia-Hong Huang; Yixian Shen; Hongyi Zhu; Stevan Rudinac; Evangelos Kanoulas

Gradient Weight-normalized Low-rank Projection for Efficient LLM Training

Jia-Hong Huang, Yixian Shen, Hongyi Zhu, Stevan Rudinac, Evangelos Kanoulas

TL;DR

This work addresses the escalating resource demands of full fine-tuning for large language models by introducing GradNormLoRP, a method that combines weight-vector normalization, low-rank weight updates, and gradient projection to stabilize training and drastically reduce memory usage. The approach reparameterizes weights with $w = \delta \frac{v}{\|v\|}$, applies a low-rank decomposition $\mathcal{W} = \mathcal{M} \frac{\mathcal{W}_0 + IJ}{\|\mathcal{W}_0 + IJ\|_c}$, and projects gradients via a compact SVD-based mechanism $\tilde{\mathcal{D}}_t = \mathcal{U}_t \eta_t(\mathcal{U}_t^\top \mathcal{D}_t \mathcal{V}_t) \mathcal{V}_t^\top$, enabling learning in a normalized low-dimensional subspace. The paper provides a theoretical guarantee (Theorem 1) that the gradient evolves toward a low-rank structure with high probability, and it demonstrates practical benefits through extensive experiments: optimizer memory reductions up to 89.5%, successful 8-bit training of LLaMA-7B on consumer GPUs without model parallelism, and superior GLUE performance for RoBERTa-base at rank 8 compared to LoRA. These findings show GradNormLoRP as a viable, memory-efficient alternative for LLM pre-training and fine-tuning with minimal inference overhead. The combination of theoretical insight and empirical gains suggests significant practical impact for accessible large-scale model training.

Abstract

Large Language Models (LLMs) have shown remarkable performance across various tasks, but the escalating demands on computational resources pose significant challenges, particularly in the extensive utilization of full fine-tuning for downstream tasks. To address this, parameter-efficient fine-tuning (PEFT) methods have been developed, but they often underperform compared to full fine-tuning and struggle with memory efficiency. In this work, we introduce Gradient Weight-Normalized Low-Rank Projection (GradNormLoRP), a novel approach that enhances both parameter and memory efficiency while maintaining comparable performance to full fine-tuning. GradNormLoRP normalizes the weight matrix to improve gradient conditioning, facilitating better convergence during optimization. Additionally, it applies low-rank approximations to the weight and gradient matrices, significantly reducing memory usage during training. Extensive experiments demonstrate that our 8-bit GradNormLoRP reduces optimizer memory usage by up to 89.5% and enables the pre-training of large LLMs, such as LLaMA 7B, on consumer-level GPUs like the NVIDIA RTX 4090, without additional inference costs. Moreover, GradNormLoRP outperforms existing low-rank methods in fine-tuning tasks. For instance, when fine-tuning the RoBERTa model on all GLUE tasks with a rank of 8, GradNormLoRP achieves an average score of 80.65, surpassing LoRA's score of 79.23. These results underscore GradNormLoRP as a promising alternative for efficient LLM pre-training and fine-tuning. Source code: https://github.com/Jhhuangkay/Gradient-Weight-normalized-Low-rank-Projection-for-Efficient-LLM-Training

Gradient Weight-normalized Low-rank Projection for Efficient LLM Training

TL;DR

, applies a low-rank decomposition

, and projects gradients via a compact SVD-based mechanism

, enabling learning in a normalized low-dimensional subspace. The paper provides a theoretical guarantee (Theorem 1) that the gradient evolves toward a low-rank structure with high probability, and it demonstrates practical benefits through extensive experiments: optimizer memory reductions up to 89.5%, successful 8-bit training of LLaMA-7B on consumer GPUs without model parallelism, and superior GLUE performance for RoBERTa-base at rank 8 compared to LoRA. These findings show GradNormLoRP as a viable, memory-efficient alternative for LLM pre-training and fine-tuning with minimal inference overhead. The combination of theoretical insight and empirical gains suggests significant practical impact for accessible large-scale model training.

Abstract

Paper Structure (16 sections, 69 equations, 3 figures, 7 tables, 1 algorithm)

This paper contains 16 sections, 69 equations, 3 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Methodology
Background
Our Proposed GradNormLoRP
Experiments
Experimental Setup
Results and Analysis
Conclusion
Acknowledgements
Proof of Our Proposed Theorem 1
Proof of Lemma 1
Normalized Subspace
Asymptotic Analysis for Existing PEFT Methods with Big-O Notation
Proof of Theorem 4 (zhao2024galore)
...and 1 more sections

Figures (3)

Figure 1: From left to right, the figure illustrates a comparison of memory usage, the impact of varying subspace frequencies, and the effect of rank across steps.
Figure 2: The diagram shows gradient descent during LLM fine-tuning in different subspaces. Fine-tuning in unnormalized subspaces (top) leads to unstable and erratic convergence, which slows down training. In contrast, normalized subspaces (bottom) result in smoother and more stable convergence, improving training efficiency. Arrow lengths represent step sizes, and different colors show learning paths in different subspaces.
Figure 3: From left to right, the figure illustrates a comparison of memory usage, the impact of varying subspace frequencies, and the effect of rank across steps.

Gradient Weight-normalized Low-rank Projection for Efficient LLM Training

TL;DR

Abstract

Gradient Weight-normalized Low-rank Projection for Efficient LLM Training

Authors

TL;DR

Abstract

Table of Contents

Figures (3)