COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection
Jinqi Xiao, Shen Sang, Tiancheng Zhi, Jing Liu, Qing Yan, Yuqian Zhang, Linjie Luo, Bo Yuan
TL;DR
COAP tackles the memory bottleneck in training large-scale models by introducing a correlation-aware gradient projection that keeps the optimizer in a continuity-preserving low-rank subspace. By optimizing the projection with a gradient-reconstruction term and a directional cosine term, and by occasionally recalibrating with a low-cost SVD, COAP achieves substantial memory savings with minimal overhead while maintaining or improving performance across vision, language, and multimodal tasks. The approach yields notable gains, including $61\%$ optimizer-memory reduction on LLaMA-1B with $2\%$ extra time, up to $81\%$ with 8-bit optimization on LLaVA-7B, and up to $4\times$ speedups over GaLore, with improved perplexity and accuracy in several settings. These results demonstrate COAP’s practicality for large-scale training when combined with other memory-efficient techniques, enabling faster experimentation and deployment of large models.
Abstract
Training large-scale neural networks in vision, and multimodal domains demands substantial memory resources, primarily due to the storage of optimizer states. While LoRA, a popular parameter-efficient method, reduces memory usage, it often suffers from suboptimal performance due to the constraints of low-rank updates. Low-rank gradient projection methods (e.g., GaLore, Flora) reduce optimizer memory by projecting gradients and moment estimates into low-rank spaces via singular value decomposition or random projection. However, they fail to account for inter-projection correlation, causing performance degradation, and their projection strategies often incur high computational costs. In this paper, we present COAP (Correlation-Aware Gradient Projection), a memory-efficient method that minimizes computational overhead while maintaining training performance. Evaluated across various vision, language, and multimodal tasks, COAP outperforms existing methods in both training speed and model performance. For LLaMA-1B, it reduces optimizer memory by 61% with only 2% additional time cost, achieving the same PPL as AdamW. With 8-bit quantization, COAP cuts optimizer memory by 81% and achieves 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.
