CE-LoRA: Computation-Efficient LoRA Fine-Tuning for Language Models
Guanduo Chen, Yutong He, Yipeng Hu, Kun Yuan, Binhang Yuan
TL;DR
CE-LoRA tackles the high compute cost of fine-tuning large language models by targeting the activation-gradient backward pass. It introduces Approximated Matrix Multiplication (AMM) and Double-LoRA to cut compute while preserving LoRA's memory benefits, and adds layer-wise adaptive sparsity to balance accuracy and efficiency. Theoretical analysis shows convergence at $O(1/\sqrt{T})$ under momentum SGD, and empirical results demonstrate up to $36.3\%$ end-to-end speedup with minimal accuracy loss on reasoning benchmarks. This work provides a practical, scalable approach for computation-efficient fine-tuning of large transformer models.
Abstract
Large Language Models (LLMs) demonstrate exceptional performance across various tasks but demand substantial computational resources even for fine-tuning computation. Although Low-Rank Adaptation (LoRA) significantly alleviates memory consumption during fine-tuning, its impact on computational cost reduction is limited. This paper identifies the computation of activation gradients as the primary bottleneck in LoRA's backward propagation and introduces the Computation-Efficient LoRA (CE-LoRA) algorithm, which enhances computational efficiency while preserving memory efficiency. CE-LoRA leverages two key techniques: Approximated Matrix Multiplication, which replaces dense multiplications of large and complete matrices with sparse multiplications involving only critical rows and columns, and the Double-LoRA technique, which reduces error propagation in activation gradients. Theoretically, CE-LoRA converges at the same rate as LoRA, $ \mathcal{O}(1/\sqrt{T}) $, where $T$ is the number of iteartions. Empirical evaluations confirm that CE-LoRA significantly reduces computational costs compared to LoRA without notable performance degradation.
