Table of Contents
Fetching ...

SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization

Kwangryeol Park, Seulki Lee

TL;DR

SMMF addresses the memory bottleneck of adaptive optimizers by introducing square-matricized momentum factorization, which enables NNMF-based compression of first and second momentum tensors of arbitrary rank. The method maps momentum tensors to near-square matrices, compresses them into two vectors, and uses a decompression→compression update cycle to preserve gradient information, achieving substantial memory savings with comparable performance. The authors provide a regret bound showing $O(\sqrt{T})$ convergence akin to Adam-based methods and demonstrate up to 96% memory reduction in CNN and Transformer tasks without sacrificing accuracy or perplexity. This work offers a practical route to memory-efficient optimization on limited hardware, expanding the feasibility of training large models on embedded or constrained devices.

Abstract

We propose SMMF (Square-Matricized Momentum Factorization), a memory-efficient optimizer that reduces the memory requirement of the widely used adaptive learning rate optimizers, such as Adam, by up to 96%. SMMF enables flexible and efficient factorization of an arbitrary rank (shape) of the first and second momentum tensors during optimization, based on the proposed square-matricization and one-time single matrix factorization. From this, it becomes effectively applicable to any rank (shape) of momentum tensors, i.e., bias, matrix, and any rank-d tensors, prevalent in various deep model architectures, such as CNNs (high rank) and Transformers (low rank), in contrast to existing memory-efficient optimizers that applies only to a particular (rank-2) momentum tensor, e.g., linear layers. We conduct a regret bound analysis of SMMF, which shows that it converges similarly to non-memory-efficient adaptive learning rate optimizers, such as AdamNC, providing a theoretical basis for its competitive optimization capability. In our experiment, SMMF takes up to 96% less memory compared to state-of-the-art memory efficient optimizers, e.g., Adafactor, CAME, and SM3, while achieving comparable model performance on various CNN and Transformer tasks.

SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization

TL;DR

SMMF addresses the memory bottleneck of adaptive optimizers by introducing square-matricized momentum factorization, which enables NNMF-based compression of first and second momentum tensors of arbitrary rank. The method maps momentum tensors to near-square matrices, compresses them into two vectors, and uses a decompression→compression update cycle to preserve gradient information, achieving substantial memory savings with comparable performance. The authors provide a regret bound showing convergence akin to Adam-based methods and demonstrate up to 96% memory reduction in CNN and Transformer tasks without sacrificing accuracy or perplexity. This work offers a practical route to memory-efficient optimization on limited hardware, expanding the feasibility of training large models on embedded or constrained devices.

Abstract

We propose SMMF (Square-Matricized Momentum Factorization), a memory-efficient optimizer that reduces the memory requirement of the widely used adaptive learning rate optimizers, such as Adam, by up to 96%. SMMF enables flexible and efficient factorization of an arbitrary rank (shape) of the first and second momentum tensors during optimization, based on the proposed square-matricization and one-time single matrix factorization. From this, it becomes effectively applicable to any rank (shape) of momentum tensors, i.e., bias, matrix, and any rank-d tensors, prevalent in various deep model architectures, such as CNNs (high rank) and Transformers (low rank), in contrast to existing memory-efficient optimizers that applies only to a particular (rank-2) momentum tensor, e.g., linear layers. We conduct a regret bound analysis of SMMF, which shows that it converges similarly to non-memory-efficient adaptive learning rate optimizers, such as AdamNC, providing a theoretical basis for its competitive optimization capability. In our experiment, SMMF takes up to 96% less memory compared to state-of-the-art memory efficient optimizers, e.g., Adafactor, CAME, and SM3, while achieving comparable model performance on various CNN and Transformer tasks.

Paper Structure

This paper contains 29 sections, 17 theorems, 45 equations, 4 figures, 25 tables, 8 algorithms.

Key Result

Theorem 3.1

Given $n_r \in \mathbb{N}$, $r \in [1, d]$, and a constant $N = \prod_{r=1}^d n_r$, then $\prod_{r=1}^{d-2} n_r(n_{d-1} + n_d)$ decreases if both $n_{d-1}$ and $n_d$ increase (Proof provided in proof:the_theorem_of_square_matrixlization_discrete2).

Figures (4)

  • Figure 1: (Left) The validation top-1 accuracy of MobileNetV2 on ImageNet. (Right) The validation mAP50 of YOLOv5s on COCO of the five optimizers.
  • Figure 2: The test perplexity of the Transformer-base model on WMT32k during full-training steps (Left) and BERT on BookCorpus & Wikipedia during pre-training steps (Right)
  • Figure 3: (Left) The validation top-1 accuracy of MobileNetV2 on ImageNet. (Right) The test perplexity of the Transformer-base model on WMT32k during full-training steps.
  • Figure 4: The loss graph of LLaMA-2-7B with LoRA fine-tuned on Alpaca dataset during the 1000 steps. The blue line indicates the Adam's performance and the oriange line indicates the SMMF's performance.

Theorems & Definitions (31)

  • Theorem 3.1
  • Corollary 3.1.1
  • Theorem 3.2
  • Theorem 4.1
  • proof
  • Lemma C.1
  • proof
  • proof
  • Theorem D.1
  • proof
  • ...and 21 more