Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices

Pengxiang Zhao; Ping Li; Yingjie Gu; Yi Zheng; Stephan Ludger Kölker; Zhefeng Wang; Xiaoming Yuan

Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices

Pengxiang Zhao, Ping Li, Yingjie Gu, Yi Zheng, Stephan Ludger Kölker, Zhefeng Wang, Xiaoming Yuan

TL;DR

Adapprox is a novel approach that employs randomized low-rank matrix approximation for a more effective and accurate approximation of Adam's second moment and enhances convergence speed and improves downstream task performance relative to its counterparts.

Abstract

As deep learning models exponentially increase in size, optimizers such as Adam encounter significant memory consumption challenges due to the storage of first and second moment data. Current memory-efficient methods like Adafactor and CAME often compromise accuracy with their matrix factorization techniques. Addressing this, we introduce Adapprox, a novel approach that employs randomized low-rank matrix approximation for a more effective and accurate approximation of Adam's second moment. Adapprox features an adaptive rank selection mechanism, finely balancing accuracy and memory efficiency, and includes an optional cosine similarity guidance strategy to enhance stability and expedite convergence. In GPT-2 training and downstream tasks, Adapprox surpasses AdamW by achieving 34.5% to 49.9% and 33.8% to 49.9% memory savings for the 117M and 345M models, respectively, with the first moment enabled, and further increases these savings without the first moment. Besides, it enhances convergence speed and improves downstream task performance relative to its counterparts.

Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices

TL;DR

Abstract

Paper Structure (18 sections, 1 theorem, 19 equations, 6 figures, 3 tables, 3 algorithms)

This paper contains 18 sections, 1 theorem, 19 equations, 6 figures, 3 tables, 3 algorithms.

Introduction
Related Work
Methodology
Overview of the Adam Optimizer
Low-Rank Approximation of the Second Moment
Adaptive Rank Selection
Adapprox Algorithm
Cosine-Similarity Guidance Strategy
Experiments
Setup
Memory Usage Comparison
GPT-2 Training
Downstream Tasks
Discussion
Conclusion
...and 3 more sections

Key Result

Proposition 3.1

Given a set of randomly generated vectors $\{u_i\}_{i=1}^k$ that are in a general linear position, and a full rank matrix $A \in \mathbb{R}^{m \times n}$, the set of vectors $\{q_i\ |\ q_i = Au_i\}_{i=1}^k$ are also linearly independent.

Figures (6)

Figure 1: Singular value distributions. This figure shows the top 60 singular values from six second moment matrices, out of a full rank of 1,024, obtained from AdamW training a GPT-2 345M model at the 45,000th iteration.
Figure 2: Comparative analysis of the S-RSI ($l=5$ and $p=5$) against Adafactor and SVD. All methods are applied to the second-moment matrices derived from training a GPT-2 345M model using the AdamW, with results captured at various stages of the training process.
Figure 3: Comparative analysis of Adapprox against AdamW, Adafactor, and CAME on pretraining GPT-2 models.
Figure 4: Comparative analysis of training loss for the GPT-2 345M model utilizing Adapprox with and without the clipping mechanism.
Figure 5: Accuracy of the AdamW-pretrained GPT-2 345M model fine-tuned with compared optimizers on the CoLA task across different learning rates.
...and 1 more figures

Theorems & Definitions (2)

Proposition 3.1
proof

Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices

TL;DR

Abstract

Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)