Table of Contents
Fetching ...

Convergence Rate Analysis of LION

Yiming Dong, Huan Li, Zhouchen Lin

TL;DR

This paper demonstrates its convergence to the Karush-Kuhn-Tucker (KKT) point at the rate of $\cal O(\sqrt{d}K^{-1/4})$ measured by gradient $\ell_1$ norm, and empirically confirms that the gradient $\ell_1/\ell_2$ norm ratio aligns with $\Theta(\sqrt{d})$ in the empirical sense.

Abstract

The LION (evoLved sIgn mOmeNtum) optimizer for deep neural network training was found by Google via program search, with the simple sign update yet showing impressive performance in training large scale networks. Although previous studies have investigated its convergence properties, a comprehensive analysis, especially the convergence rate, is still desirable. Recognizing that LION can be regarded as solving a specific constrained problem, this paper focuses on demonstrating its convergence to the Karush-Kuhn-Tucker (KKT) point at the rate of $\cal O(\sqrt{d}K^{-1/4})$ measured by gradient $\ell_1$ norm, where $d$ is the problem dimension and $K$ is the number of iteration steps. Step further, we remove the constraint and establish that LION converges to the critical point of the general unconstrained problem at the same rate. This rate not only delivers the currently optimal dependence on the problem dimension $d$ but also tightly matches the theoretical lower bound for nonconvex stochastic optimization algorithms, which is typically measured using the gradient $\ell_2$ norm, with respect to the number of iterations $K$. Through extensive experiments, we not only demonstrate that LION achieves lower loss and higher performance compared to standard SGD, but also empirically confirm that the gradient $\ell_1/\ell_2$ norm ratio aligns with $Θ(\sqrt{d})$, thus proving that our convergence rate matches the theoretical lower bound with respect to $d$ in the empirical sense.

Convergence Rate Analysis of LION

TL;DR

This paper demonstrates its convergence to the Karush-Kuhn-Tucker (KKT) point at the rate of measured by gradient norm, and empirically confirms that the gradient norm ratio aligns with in the empirical sense.

Abstract

The LION (evoLved sIgn mOmeNtum) optimizer for deep neural network training was found by Google via program search, with the simple sign update yet showing impressive performance in training large scale networks. Although previous studies have investigated its convergence properties, a comprehensive analysis, especially the convergence rate, is still desirable. Recognizing that LION can be regarded as solving a specific constrained problem, this paper focuses on demonstrating its convergence to the Karush-Kuhn-Tucker (KKT) point at the rate of measured by gradient norm, where is the problem dimension and is the number of iteration steps. Step further, we remove the constraint and establish that LION converges to the critical point of the general unconstrained problem at the same rate. This rate not only delivers the currently optimal dependence on the problem dimension but also tightly matches the theoretical lower bound for nonconvex stochastic optimization algorithms, which is typically measured using the gradient norm, with respect to the number of iterations . Through extensive experiments, we not only demonstrate that LION achieves lower loss and higher performance compared to standard SGD, but also empirically confirm that the gradient norm ratio aligns with , thus proving that our convergence rate matches the theoretical lower bound with respect to in the empirical sense.

Paper Structure

This paper contains 19 sections, 6 theorems, 35 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

$\bm{\theta}^\star$ is a KKT point of problem (constrainedproblem) iff

Figures (5)

  • Figure 1: Overview of results of ResNet18 he2016deep, ResNet50 he2016deep, and ViT-S dosovitskiy2020image models training and evaluating on CIFAR-100 krizhevsky2009learning dataset. Panels (a), (c), and (e) depict the training loss and Top-1 accuracy, and panels (b), (d), and (f) illustrate the gradient norm ratio.
  • Figure 2: Overview of results for BERT-Small and BERT-Base models training and evaluating on the OpenWebText Gokaslan2019OpenWeb dataset. Panels (a) and (c) depict the training loss and test loss, while panels (b) and (d) illustrate the gradient norm ratio.
  • Figure 3: Overview of results of GPT-2 radford2019language Small and Medium models training and evaluating on the OpenWebText Gokaslan2019OpenWeb dataset. Panels (a) and (c) depict the training loss and test loss, while panels (b) and (d) illustrate the gradient norm ratio.
  • Figure 4: Overview of results of ResNet18 he2016deep, ResNet50 he2016deep, and ViT-S dosovitskiy2020image models training and evaluating on CIFAR-10 krizhevsky2009learning dataset.
  • Figure 5: Overview of results of ResNet18 he2016deep, ResNet50 he2016deep, and ViT-S dosovitskiy2020image models training and evaluating on ImageNet-1K ILSVRC15 dataset.

Theorems & Definitions (10)

  • Lemma 1: Lemma 3.8 in xie2024implicit with $\ell_\infty$ norm
  • Theorem 2
  • Corollary 3
  • Theorem 4
  • Corollary 5
  • proof
  • Lemma 6
  • proof
  • proof
  • proof