Table of Contents
Fetching ...

Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices

Pengxiang Zhao, Ping Li, Yingjie Gu, Yi Zheng, Stephan Ludger Kölker, Zhefeng Wang, Xiaoming Yuan

TL;DR

Adapprox is a novel approach that employs randomized low-rank matrix approximation for a more effective and accurate approximation of Adam's second moment and enhances convergence speed and improves downstream task performance relative to its counterparts.

Abstract

As deep learning models exponentially increase in size, optimizers such as Adam encounter significant memory consumption challenges due to the storage of first and second moment data. Current memory-efficient methods like Adafactor and CAME often compromise accuracy with their matrix factorization techniques. Addressing this, we introduce Adapprox, a novel approach that employs randomized low-rank matrix approximation for a more effective and accurate approximation of Adam's second moment. Adapprox features an adaptive rank selection mechanism, finely balancing accuracy and memory efficiency, and includes an optional cosine similarity guidance strategy to enhance stability and expedite convergence. In GPT-2 training and downstream tasks, Adapprox surpasses AdamW by achieving 34.5% to 49.9% and 33.8% to 49.9% memory savings for the 117M and 345M models, respectively, with the first moment enabled, and further increases these savings without the first moment. Besides, it enhances convergence speed and improves downstream task performance relative to its counterparts.

Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices

TL;DR

Adapprox is a novel approach that employs randomized low-rank matrix approximation for a more effective and accurate approximation of Adam's second moment and enhances convergence speed and improves downstream task performance relative to its counterparts.

Abstract

As deep learning models exponentially increase in size, optimizers such as Adam encounter significant memory consumption challenges due to the storage of first and second moment data. Current memory-efficient methods like Adafactor and CAME often compromise accuracy with their matrix factorization techniques. Addressing this, we introduce Adapprox, a novel approach that employs randomized low-rank matrix approximation for a more effective and accurate approximation of Adam's second moment. Adapprox features an adaptive rank selection mechanism, finely balancing accuracy and memory efficiency, and includes an optional cosine similarity guidance strategy to enhance stability and expedite convergence. In GPT-2 training and downstream tasks, Adapprox surpasses AdamW by achieving 34.5% to 49.9% and 33.8% to 49.9% memory savings for the 117M and 345M models, respectively, with the first moment enabled, and further increases these savings without the first moment. Besides, it enhances convergence speed and improves downstream task performance relative to its counterparts.
Paper Structure (18 sections, 1 theorem, 19 equations, 6 figures, 3 tables, 3 algorithms)

This paper contains 18 sections, 1 theorem, 19 equations, 6 figures, 3 tables, 3 algorithms.

Key Result

Proposition 3.1

Given a set of randomly generated vectors $\{u_i\}_{i=1}^k$ that are in a general linear position, and a full rank matrix $A \in \mathbb{R}^{m \times n}$, the set of vectors $\{q_i\ |\ q_i = Au_i\}_{i=1}^k$ are also linearly independent.

Figures (6)

  • Figure 1: Singular value distributions. This figure shows the top 60 singular values from six second moment matrices, out of a full rank of 1,024, obtained from AdamW training a GPT-2 345M model at the 45,000th iteration.
  • Figure 2: Comparative analysis of the S-RSI ($l=5$ and $p=5$) against Adafactor and SVD. All methods are applied to the second-moment matrices derived from training a GPT-2 345M model using the AdamW, with results captured at various stages of the training process.
  • Figure 3: Comparative analysis of Adapprox against AdamW, Adafactor, and CAME on pretraining GPT-2 models.
  • Figure 4: Comparative analysis of training loss for the GPT-2 345M model utilizing Adapprox with and without the clipping mechanism.
  • Figure 5: Accuracy of the AdamW-pretrained GPT-2 345M model fine-tuned with compared optimizers on the CoLA task across different learning rates.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Proposition 3.1
  • proof