Table of Contents
Fetching ...

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, Zhangyang Wang

TL;DR

Q-GaLore tackles the memory bottleneck of training large language models by fusing quantization with adaptive, low-rank gradient updates. It maintains 8-bit weights and 4-bit projection matrices, using stochastic rounding and layerwise lazy SVD updates to preserve training fidelity while reducing memory and compute. The approach achieves competitive pre-training and fine-tuning performance across models (including LLaMA-7B) on as little as 16GB, enabling scenarios previously infeasible on consumer hardware. Empirical results show substantial memory savings (up to ~60% over GaLore/Full) and time savings from reduced SVD calls without sacrificing performance compared to strong baselines.

Abstract

Training Large Language Models (LLMs) is memory-intensive due to the large number of parameters and associated optimization states. GaLore, a recent method, reduces memory usage by projecting weight gradients into a low-rank subspace without compromising performance. However, GaLore relies on time-consuming Singular Value Decomposition (SVD) operations to identify the subspace, and the frequent subspace updates lead to significant training time overhead. Moreover, GaLore offers minimal improvements in accuracy and efficiency compared to LoRA in more accessible fine-tuning scenarios. To address these limitations, we introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection, surpassing the benefits of GaLore. Our method is based on two key observations: (i) the gradient subspace exhibits diverse properties, with some layers converging early in training while others are subject to frequent changes; (ii) the projection matrices are highly resilient to low-bit quantization. Leveraging these insights, Q-GaLore adaptively updates the gradient subspace based on its convergence statistics, achieving comparable performance while significantly reducing the number of SVD operations. We maintain the projection matrices in INT4 format and weights in INT8 format, incorporating stochastic rounding to capture accumulated gradient information. This approach enables a high-precision training trajectory using only low-precision weights. We demonstrate that Q-GaLore achieves highly competitive performance with exceptional memory efficiency. At pre-training, Q-GaLore facilitates training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory. At fine-tuning, it reduces memory consumption by up to 50% compared to LoRA and GaLore, while consistently outperforming QLoRA at the same memory cost.

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

TL;DR

Q-GaLore tackles the memory bottleneck of training large language models by fusing quantization with adaptive, low-rank gradient updates. It maintains 8-bit weights and 4-bit projection matrices, using stochastic rounding and layerwise lazy SVD updates to preserve training fidelity while reducing memory and compute. The approach achieves competitive pre-training and fine-tuning performance across models (including LLaMA-7B) on as little as 16GB, enabling scenarios previously infeasible on consumer hardware. Empirical results show substantial memory savings (up to ~60% over GaLore/Full) and time savings from reduced SVD calls without sacrificing performance compared to strong baselines.

Abstract

Training Large Language Models (LLMs) is memory-intensive due to the large number of parameters and associated optimization states. GaLore, a recent method, reduces memory usage by projecting weight gradients into a low-rank subspace without compromising performance. However, GaLore relies on time-consuming Singular Value Decomposition (SVD) operations to identify the subspace, and the frequent subspace updates lead to significant training time overhead. Moreover, GaLore offers minimal improvements in accuracy and efficiency compared to LoRA in more accessible fine-tuning scenarios. To address these limitations, we introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection, surpassing the benefits of GaLore. Our method is based on two key observations: (i) the gradient subspace exhibits diverse properties, with some layers converging early in training while others are subject to frequent changes; (ii) the projection matrices are highly resilient to low-bit quantization. Leveraging these insights, Q-GaLore adaptively updates the gradient subspace based on its convergence statistics, achieving comparable performance while significantly reducing the number of SVD operations. We maintain the projection matrices in INT4 format and weights in INT8 format, incorporating stochastic rounding to capture accumulated gradient information. This approach enables a high-precision training trajectory using only low-precision weights. We demonstrate that Q-GaLore achieves highly competitive performance with exceptional memory efficiency. At pre-training, Q-GaLore facilitates training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory. At fine-tuning, it reduces memory consumption by up to 50% compared to LoRA and GaLore, while consistently outperforming QLoRA at the same memory cost.
Paper Structure (26 sections, 2 equations, 7 figures, 4 tables)

This paper contains 26 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison of data types and training flows of different methods. We by default use 8-bits Adam dettmers20218 as the inner optimizer. Note that the gradient in GaLore and Q-GaLore is not persistent during training, following the same strategy in lv2023fulllv2023adalomo.
  • Figure 2: Cosine similarity between the adjacent projection matrices captured every 250 training iterations.
  • Figure 3: Pre-training performance on the LLaMA-130M models. The projection matrices are quantized with different bits.
  • Figure 4: Illustration of the training flows for Q-GaLore, where the dotted icon denotes intermediate tensors that do not consistently occupy memory.
  • Figure 5: Results of the memory allocation of training a LLaMA-7B model with a single batch size of 256.
  • ...and 2 more figures