SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization

Kwangryeol Park; Seulki Lee

SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization

Kwangryeol Park, Seulki Lee

TL;DR

SMMF addresses the memory bottleneck of adaptive optimizers by introducing square-matricized momentum factorization, which enables NNMF-based compression of first and second momentum tensors of arbitrary rank. The method maps momentum tensors to near-square matrices, compresses them into two vectors, and uses a decompression→compression update cycle to preserve gradient information, achieving substantial memory savings with comparable performance. The authors provide a regret bound showing $O(\sqrt{T})$ convergence akin to Adam-based methods and demonstrate up to 96% memory reduction in CNN and Transformer tasks without sacrificing accuracy or perplexity. This work offers a practical route to memory-efficient optimization on limited hardware, expanding the feasibility of training large models on embedded or constrained devices.

Abstract

We propose SMMF (Square-Matricized Momentum Factorization), a memory-efficient optimizer that reduces the memory requirement of the widely used adaptive learning rate optimizers, such as Adam, by up to 96%. SMMF enables flexible and efficient factorization of an arbitrary rank (shape) of the first and second momentum tensors during optimization, based on the proposed square-matricization and one-time single matrix factorization. From this, it becomes effectively applicable to any rank (shape) of momentum tensors, i.e., bias, matrix, and any rank-d tensors, prevalent in various deep model architectures, such as CNNs (high rank) and Transformers (low rank), in contrast to existing memory-efficient optimizers that applies only to a particular (rank-2) momentum tensor, e.g., linear layers. We conduct a regret bound analysis of SMMF, which shows that it converges similarly to non-memory-efficient adaptive learning rate optimizers, such as AdamNC, providing a theoretical basis for its competitive optimization capability. In our experiment, SMMF takes up to 96% less memory compared to state-of-the-art memory efficient optimizers, e.g., Adafactor, CAME, and SM3, while achieving comparable model performance on various CNN and Transformer tasks.

SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization

TL;DR

Abstract

SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (31)