Table of Contents
Fetching ...

MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation

Wei Shen, Zhang Yaxiang, Minhui Huang, Mengfan Xu, Jiawei Zhang, Cong Shen

TL;DR

MLorc tackles the memory bottleneck of fine-tuning large language models by compressing and reconstructing the momentum of matrix parameters using Randomized SVD, enabling full-parameter updates with reduced memory. The method preserves training dynamics more faithfully than gradient-based approaches and provides convergence guarantees for the Lion optimizer. Empirically, MLorc outperforms LoRA, GaLore, and LDAdamW across NLG and NLU tasks, closely matching or exceeding full fine-tuning at small ranks (e.g., r = 4) while maintaining competitive runtime and memory. This yields a practical pathway to memory-efficient, high-quality fine-tuning of large models and suggests potential extensions to pre-training and larger architectures.

Abstract

With increasing size of large language models (LLMs), full-parameter fine-tuning imposes substantial memory demands. To alleviate this, we propose a novel memory-efficient training paradigm called Momentum Low-rank compression (MLorc). The key idea of MLorc is to compress and reconstruct the momentum of matrix parameters during training to reduce memory consumption. Compared to LoRA, MLorc avoids enforcing a fixed-rank constraint on weight update matrices and thus enables full-parameter learning. Compared to GaLore, MLorc directly compress the momentum rather than gradients, thereby better preserving the training dynamics of full-parameter fine-tuning. We provide a theoretical guarantee for its convergence under mild assumptions. Empirically, MLorc consistently outperforms other memory-efficient training methods, matches or even exceeds the performance of full fine-tuning at small ranks (e.g., $r=4$), and generalizes well across different optimizers -- all while not compromising time or memory efficiency.

MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation

TL;DR

MLorc tackles the memory bottleneck of fine-tuning large language models by compressing and reconstructing the momentum of matrix parameters using Randomized SVD, enabling full-parameter updates with reduced memory. The method preserves training dynamics more faithfully than gradient-based approaches and provides convergence guarantees for the Lion optimizer. Empirically, MLorc outperforms LoRA, GaLore, and LDAdamW across NLG and NLU tasks, closely matching or exceeding full fine-tuning at small ranks (e.g., r = 4) while maintaining competitive runtime and memory. This yields a practical pathway to memory-efficient, high-quality fine-tuning of large models and suggests potential extensions to pre-training and larger architectures.

Abstract

With increasing size of large language models (LLMs), full-parameter fine-tuning imposes substantial memory demands. To alleviate this, we propose a novel memory-efficient training paradigm called Momentum Low-rank compression (MLorc). The key idea of MLorc is to compress and reconstruct the momentum of matrix parameters during training to reduce memory consumption. Compared to LoRA, MLorc avoids enforcing a fixed-rank constraint on weight update matrices and thus enables full-parameter learning. Compared to GaLore, MLorc directly compress the momentum rather than gradients, thereby better preserving the training dynamics of full-parameter fine-tuning. We provide a theoretical guarantee for its convergence under mild assumptions. Empirically, MLorc consistently outperforms other memory-efficient training methods, matches or even exceeds the performance of full fine-tuning at small ranks (e.g., ), and generalizes well across different optimizers -- all while not compromising time or memory efficiency.

Paper Structure

This paper contains 24 sections, 4 theorems, 23 equations, 4 figures, 7 tables, 3 algorithms.

Key Result

Theorem 3.3

Under Assumptions ass: smooth and ass: variance, applying alg: lion with appropriate parameters, we have where $\Delta=f(W_0)-\inf_{W}f(W)$, $d=mn$.

Figures (4)

  • Figure 1: Ratio of top-8 singular values to total singular values for gradient, first moment, and second moment during AdamW finetuning of RoBERTa-base on the STSB dataset.
  • Figure 2: Training Loss of AdamW of different methods
  • Figure 3: Training Loss of Full Lion and Lion with MLorc
  • Figure 4: Ratio of top-8 singular values to total singular values for gradient, first moment, and second moment during AdamW finetuning of RoBERTa-base on the CoLA, MRPC, RTE, STSB datasets.

Theorems & Definitions (7)

  • Theorem 3.3: informal
  • Lemma A.1: Approximation error bound of RSVD
  • Lemma B.1
  • proof
  • Lemma B.2
  • proof
  • proof