Table of Contents
Fetching ...

Grams: Gradient Descent with Adaptive Momentum Scaling

Yang Cao, Xiaoyu Li, Zhao Song

TL;DR

The theoretically demonstrate that Grams descents faster than other state-of-the-art optimizers and establish a global convergence guarantee for Grams, and highlight Grams' potential as a transformative approach for efficiently training and fine-tuning large language models.

Abstract

We introduce $\mathbf{G}$radient Descent with $\mathbf{A}$daptive $\mathbf{M}$omentum $\mathbf{S}$caling ($\mathbf{Grams}$), a novel optimization algorithm that decouples the direction and magnitude of parameter updates in deep learning. Unlike traditional optimizers that directly integrate momentum into updates, Grams separates the update direction, derived from current gradients, from momentum, which is used solely for adaptive magnitude scaling. This approach enables Grams to achieve improved loss descent compared to state-of-the-art cautious and momentum-based optimizers. We theoretically demonstrate that Grams descents faster than other state-of-the-art optimizers and establish a global convergence guarantee for Grams. We also validate its effectiveness through extensive empirical evaluations. The results demonstrate Grams' superior performance, including faster convergence and better generalization, compared to widely-used optimizers such as Adam, Lion, and their cautious variants. Our results highlight Grams' potential as a transformative approach for efficiently training and fine-tuning large language models. Code is available at https://github.com/Gunale0926/Grams.

Grams: Gradient Descent with Adaptive Momentum Scaling

TL;DR

The theoretically demonstrate that Grams descents faster than other state-of-the-art optimizers and establish a global convergence guarantee for Grams, and highlight Grams' potential as a transformative approach for efficiently training and fine-tuning large language models.

Abstract

We introduce radient Descent with daptive omentum caling (), a novel optimization algorithm that decouples the direction and magnitude of parameter updates in deep learning. Unlike traditional optimizers that directly integrate momentum into updates, Grams separates the update direction, derived from current gradients, from momentum, which is used solely for adaptive magnitude scaling. This approach enables Grams to achieve improved loss descent compared to state-of-the-art cautious and momentum-based optimizers. We theoretically demonstrate that Grams descents faster than other state-of-the-art optimizers and establish a global convergence guarantee for Grams. We also validate its effectiveness through extensive empirical evaluations. The results demonstrate Grams' superior performance, including faster convergence and better generalization, compared to widely-used optimizers such as Adam, Lion, and their cautious variants. Our results highlight Grams' potential as a transformative approach for efficiently training and fine-tuning large language models. Code is available at https://github.com/Gunale0926/Grams.

Paper Structure

This paper contains 33 sections, 15 theorems, 67 equations, 1 figure, 8 tables, 1 algorithm.

Key Result

Lemma 3.9

Suppose that $\mathcal{L}:\mathbb{R}^d \to \mathbb{R}$ is $L$-smooth. Let $\Delta \mathcal{L}_{w_{t+1}^{\mathrm{C}}, w_t}$ be defined in Definition def:delta_l, $w_{t+1}^{\mathrm{C}}$ is updated from $w_t$ using Definition def:cautious_update. Then we have the followings:

Figures (1)

  • Figure 1: Convergence comparison on a simple convex function $f(w) := (0.5 w_1)^2 + (0.1 w_2)^2$. Learning rate $\eta = 0.01$ for Grams, Adam, and C-Adam, and $\eta = 0.001$ for Lion and C-Lion. $\beta_1$ and $\beta_2$ are default values for all optimizers. The graph on the left is the optimizing trajectories; the graph in the middle graph is the distance between current weight and optimum weight; the graph on the right is the training objectives.

Theorems & Definitions (38)

  • Definition 3.1: $L$-smooth
  • Definition 3.3: PL-condition
  • Definition 3.4: Sign function
  • Definition 3.5: Adam
  • Definition 3.6: Lion Parameter Update
  • Definition 3.7: Cautious Mechanism Parameter Update
  • Definition 3.8
  • Lemma 3.9: Informal version of Lemma \ref{['lem:delta_l_c']}
  • Definition 4.1: Grams Parameter Update
  • Lemma 4.2: Informal version of Lemma \ref{['lem:delta_l_grams']}
  • ...and 28 more