Table of Contents
Fetching ...

GaLore$+$: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection

Xutao Liao, Shaohui Li, Yuhui Xu, Zhi Li, Yu Liu, You He

TL;DR

GaLore$+$ tackles the time bottleneck of SVD-based low-rank projections in fine-tuning LLMs by introducing cross-head projection, which shares projection matrices across attention heads to reduce complexity, and by employing randomized subspace iteration for faster SVD. It further mitigates approximation errors through sparsely coded residuals applied to optimizer moments, with a warm-up phase to build an efficient sparse indexing scheme. Empirical results on arithmetic reasoning and natural language generation show GaLore$+$ achieves superior performance while delivering approximately 4× faster fine-tuning than vanilla GaLore. The approach thus offers a practical, memory-efficient path for high-parameter LLM adaptation with strong task performance and reduced compute requirements.

Abstract

Recent low-rank training methods, such as GaLore, have significantly reduced the memory required to optimize large language models (LLMs). However, these methods often suffer from time-consuming low-rank projection estimations. In particular, the singular value decomposition (SVD) in GaLore can consume more than 80\% of the total training time. To address this issue, we propose GaLore$+$, which uses cross-head low-rank projection to reduce the substantial time consumption in estimating low-rank projections for multi-head attention. In addition, we employ randomized subspace iteration to achieve fast SVD. To further enhance performance, we propose sparsely coded residuals to reduce the errors caused by low-rank approximation on the first- and second-order moments of the optimizers and weight updates. We evaluate GaLore$+$ on arithmetic reasoning and natural language generation datasets. Our experiments demonstrate that GaLore$+$ delivers superior performance while achieving approximately $4\times$ fine-tuning speed compared to vanilla GaLore.

GaLore$+$: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection

TL;DR

GaLore tackles the time bottleneck of SVD-based low-rank projections in fine-tuning LLMs by introducing cross-head projection, which shares projection matrices across attention heads to reduce complexity, and by employing randomized subspace iteration for faster SVD. It further mitigates approximation errors through sparsely coded residuals applied to optimizer moments, with a warm-up phase to build an efficient sparse indexing scheme. Empirical results on arithmetic reasoning and natural language generation show GaLore achieves superior performance while delivering approximately 4× faster fine-tuning than vanilla GaLore. The approach thus offers a practical, memory-efficient path for high-parameter LLM adaptation with strong task performance and reduced compute requirements.

Abstract

Recent low-rank training methods, such as GaLore, have significantly reduced the memory required to optimize large language models (LLMs). However, these methods often suffer from time-consuming low-rank projection estimations. In particular, the singular value decomposition (SVD) in GaLore can consume more than 80\% of the total training time. To address this issue, we propose GaLore, which uses cross-head low-rank projection to reduce the substantial time consumption in estimating low-rank projections for multi-head attention. In addition, we employ randomized subspace iteration to achieve fast SVD. To further enhance performance, we propose sparsely coded residuals to reduce the errors caused by low-rank approximation on the first- and second-order moments of the optimizers and weight updates. We evaluate GaLore on arithmetic reasoning and natural language generation datasets. Our experiments demonstrate that GaLore delivers superior performance while achieving approximately fine-tuning speed compared to vanilla GaLore.
Paper Structure (19 sections, 27 equations, 4 figures, 8 tables, 2 algorithms)

This paper contains 19 sections, 27 equations, 4 figures, 8 tables, 2 algorithms.

Figures (4)

  • Figure 1: We compare the time consumption for fine-tuning LLaMA2-7B on different datasets with GaLore and GaLore$+$.
  • Figure 2: Motivations for cross-head low-rank projection. (a) illustrates the time consumption of SVD and other operations when fine-tuning an LLaMA2-7B model on different datasets with GaLore. MHA is short for multi-head attention. (b) presents the approximation errors of low-rank projection with cross-head projection (i.e., GaLore$+$) and conventional projection (i.e., GaLore).
  • Figure 3: Ablation study on the sparsely coded residual, when fine-tuning LLaMA2-7B
  • Figure 4: Ablation study on the sparsely coded residual, when fine-tuning LLaMA3-8B. The results are obtained by adjusting the proportion of non-zero elements in the sparse indexing matrix.