Table of Contents
Fetching ...

COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection

Jinqi Xiao, Shen Sang, Tiancheng Zhi, Jing Liu, Qing Yan, Yuqian Zhang, Linjie Luo, Bo Yuan

TL;DR

COAP tackles the memory bottleneck in training large-scale models by introducing a correlation-aware gradient projection that keeps the optimizer in a continuity-preserving low-rank subspace. By optimizing the projection with a gradient-reconstruction term and a directional cosine term, and by occasionally recalibrating with a low-cost SVD, COAP achieves substantial memory savings with minimal overhead while maintaining or improving performance across vision, language, and multimodal tasks. The approach yields notable gains, including $61\%$ optimizer-memory reduction on LLaMA-1B with $2\%$ extra time, up to $81\%$ with 8-bit optimization on LLaVA-7B, and up to $4\times$ speedups over GaLore, with improved perplexity and accuracy in several settings. These results demonstrate COAP’s practicality for large-scale training when combined with other memory-efficient techniques, enabling faster experimentation and deployment of large models.

Abstract

Training large-scale neural networks in vision, and multimodal domains demands substantial memory resources, primarily due to the storage of optimizer states. While LoRA, a popular parameter-efficient method, reduces memory usage, it often suffers from suboptimal performance due to the constraints of low-rank updates. Low-rank gradient projection methods (e.g., GaLore, Flora) reduce optimizer memory by projecting gradients and moment estimates into low-rank spaces via singular value decomposition or random projection. However, they fail to account for inter-projection correlation, causing performance degradation, and their projection strategies often incur high computational costs. In this paper, we present COAP (Correlation-Aware Gradient Projection), a memory-efficient method that minimizes computational overhead while maintaining training performance. Evaluated across various vision, language, and multimodal tasks, COAP outperforms existing methods in both training speed and model performance. For LLaMA-1B, it reduces optimizer memory by 61% with only 2% additional time cost, achieving the same PPL as AdamW. With 8-bit quantization, COAP cuts optimizer memory by 81% and achieves 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.

COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection

TL;DR

COAP tackles the memory bottleneck in training large-scale models by introducing a correlation-aware gradient projection that keeps the optimizer in a continuity-preserving low-rank subspace. By optimizing the projection with a gradient-reconstruction term and a directional cosine term, and by occasionally recalibrating with a low-cost SVD, COAP achieves substantial memory savings with minimal overhead while maintaining or improving performance across vision, language, and multimodal tasks. The approach yields notable gains, including optimizer-memory reduction on LLaMA-1B with extra time, up to with 8-bit optimization on LLaVA-7B, and up to speedups over GaLore, with improved perplexity and accuracy in several settings. These results demonstrate COAP’s practicality for large-scale training when combined with other memory-efficient techniques, enabling faster experimentation and deployment of large models.

Abstract

Training large-scale neural networks in vision, and multimodal domains demands substantial memory resources, primarily due to the storage of optimizer states. While LoRA, a popular parameter-efficient method, reduces memory usage, it often suffers from suboptimal performance due to the constraints of low-rank updates. Low-rank gradient projection methods (e.g., GaLore, Flora) reduce optimizer memory by projecting gradients and moment estimates into low-rank spaces via singular value decomposition or random projection. However, they fail to account for inter-projection correlation, causing performance degradation, and their projection strategies often incur high computational costs. In this paper, we present COAP (Correlation-Aware Gradient Projection), a memory-efficient method that minimizes computational overhead while maintaining training performance. Evaluated across various vision, language, and multimodal tasks, COAP outperforms existing methods in both training speed and model performance. For LLaMA-1B, it reduces optimizer memory by 61% with only 2% additional time cost, achieving the same PPL as AdamW. With 8-bit quantization, COAP cuts optimizer memory by 81% and achieves 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.

Paper Structure

This paper contains 28 sections, 1 theorem, 23 equations, 6 figures, 13 tables, 3 algorithms.

Key Result

Theorem 1.1

Assume that the gradient is Lipchitz with respect to the weight with Lipchitz constant $L$. Assume $\kappa_1\doteq\frac{\sigma_1}{\sigma_{r+1}}>1$ and $\kappa_r\doteq\frac{\sigma_r}{\sigma_{r+1}}>1$ to be the ratio between the $1$-th and $r$-th to the $r+1$-th singular value of the gradient $\bm{G}$

Figures (6)

  • Figure 1: Comparison between COAP and other low-rank-based methods. The X-axis shows additional training time, with lower values being better. The Y-axis shows performance (e.g., FID, PPL) changes compared to the original optimizer (e.g., Adam kingma2014adam, Adafactor shazeer2018adafactor), with higher values indicating better performance.
  • Figure 1: The comparison of Top-1 accuracy for ResNet under different low-rank projection formats is conducted with a rank ratio of 4. The models are trained for 100 epochs on the CIFAR-100 dataset.
  • Figure 2: Comparison of optimization trajectories between COAP and GaLore. $\bm{P}_{t-1}$ represents the low-rank projection space from the previous cycle, and $\bm{P}_t$ represents the current low-rank projection space. The symbol "$\star$" denotes the global optimum. GaLore updates $\bm{P}_t$ based on a batch of stochastic data, which can lead to suboptimal updates if the data significantly deviates from the overall data distribution. In contrast, our method uses the more stable first-order moment $\bm{M}_{t-1}$ as guidance, mitigating this issue.
  • Figure 3: Cumulative effective update (CEU) and Top-1 accuracy over 300 epochs for different optimization methods on the CIFAR-100 dataset using the DeiT-Base model trained from scratch. The rank of $\bm{P}_t$ is 192, the learning rate is $5\times10^{-5}$, and all models are trained on a single A100 GPU with a batch size of 256. The magnitude of CEU indicates the extent to which the optimizer influences the model.
  • Figure 4: Ablation study on hyper-parameters $\lambda$, $r$, and $T_u$ for DeiT-Base on CIFAR-100 over 300 epochs. Here, $\lambda=\rm{None}$ means that occasional low-cost SVD does not participate in updating the low-rank projection matrix.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1.1