Table of Contents
Fetching ...

Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts

Lizhang Chen, Bo Liu, Kaizhao Liang, Qiang Liu

TL;DR

This work provides the first principled Lyapunov-based theory for Lion, an optimizer discovered by symbolic search, by framing Lion-${\mathcal{K}}$ as a dynamics that minimizes a bound-constrained composite objective $F(x)=\alpha f(x)+\frac{\gamma}{\lambda}{\mathcal{K}}^*(\lambda x)$ under a convex constraint set. The authors develop a novel Lyapunov function and show a two-phase continuous-time behavior: Phase 1 quickly enforces the constraint via decay to $\mathrm{dom}{\mathcal{K}}^*$, and Phase 2 monotonically reduces the objective to a stationary point; they also provide a discrete-time analysis and extend results to stochastic gradients. By connecting Lion to a broad family of algorithms (including mirror descent and Frank–Wolfe) and demonstrating how different ${\mathcal{K}}$ yield varied constraints/regularizations, the paper offers a principled path to design new optimizers with theoretical guarantees. Empirical results on toy tasks and large-scale models (e.g., ImageNet and GPT-2) corroborate the theory, while suggesting architecture-dependent choices for ${\mathcal{K}}$ and confirming that a stronger bound can speed up convergence at some cost to final performance. This framework advances understanding of optimization dynamics in deep learning and opens avenues for tailored constrained optimizers with potential practical gains.

Abstract

Lion (Evolved Sign Momentum), a new optimizer discovered through program search, has shown promising results in training large AI models. It performs comparably or favorably to AdamW but with greater memory efficiency. As we can expect from the results of a random search program, Lion incorporates elements from several existing algorithms, including signed momentum, decoupled weight decay, Polak, and Nesterov momentum, but does not fit into any existing category of theoretically grounded optimizers. Thus, even though Lion appears to perform well as a general-purpose optimizer for a wide range of tasks, its theoretical basis remains uncertain. This lack of theoretical clarity limits opportunities to further enhance and expand Lion's efficacy. This work aims to demystify Lion. Based on both continuous-time and discrete-time analysis, we demonstrate that Lion is a theoretically novel and principled approach for minimizing a general loss function $f(x)$ while enforcing a bound constraint $\|x\|_\infty \leq 1/λ$. Lion achieves this through the incorporation of decoupled weight decay, where $λ$ represents the weight decay coefficient. Our analysis is made possible by the development of a new Lyapunov function for the Lion updates. It applies to a broader family of Lion-$κ$ algorithms, where the $\text{sign}(\cdot)$ operator in Lion is replaced by the subgradient of a convex function $κ$, leading to the solution of a general composite optimization problem of $\min_x f(x) + κ^*(x)$. Our findings provide valuable insights into the dynamics of Lion and pave the way for further improvements and extensions of Lion-related algorithms.

Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts

TL;DR

This work provides the first principled Lyapunov-based theory for Lion, an optimizer discovered by symbolic search, by framing Lion- as a dynamics that minimizes a bound-constrained composite objective under a convex constraint set. The authors develop a novel Lyapunov function and show a two-phase continuous-time behavior: Phase 1 quickly enforces the constraint via decay to , and Phase 2 monotonically reduces the objective to a stationary point; they also provide a discrete-time analysis and extend results to stochastic gradients. By connecting Lion to a broad family of algorithms (including mirror descent and Frank–Wolfe) and demonstrating how different yield varied constraints/regularizations, the paper offers a principled path to design new optimizers with theoretical guarantees. Empirical results on toy tasks and large-scale models (e.g., ImageNet and GPT-2) corroborate the theory, while suggesting architecture-dependent choices for and confirming that a stronger bound can speed up convergence at some cost to final performance. This framework advances understanding of optimization dynamics in deep learning and opens avenues for tailored constrained optimizers with potential practical gains.

Abstract

Lion (Evolved Sign Momentum), a new optimizer discovered through program search, has shown promising results in training large AI models. It performs comparably or favorably to AdamW but with greater memory efficiency. As we can expect from the results of a random search program, Lion incorporates elements from several existing algorithms, including signed momentum, decoupled weight decay, Polak, and Nesterov momentum, but does not fit into any existing category of theoretically grounded optimizers. Thus, even though Lion appears to perform well as a general-purpose optimizer for a wide range of tasks, its theoretical basis remains uncertain. This lack of theoretical clarity limits opportunities to further enhance and expand Lion's efficacy. This work aims to demystify Lion. Based on both continuous-time and discrete-time analysis, we demonstrate that Lion is a theoretically novel and principled approach for minimizing a general loss function while enforcing a bound constraint . Lion achieves this through the incorporation of decoupled weight decay, where represents the weight decay coefficient. Our analysis is made possible by the development of a new Lyapunov function for the Lion updates. It applies to a broader family of Lion- algorithms, where the operator in Lion is replaced by the subgradient of a convex function , leading to the solution of a general composite optimization problem of . Our findings provide valuable insights into the dynamics of Lion and pave the way for further improvements and extensions of Lion-related algorithms.
Paper Structure (51 sections, 18 theorems, 98 equations, 7 figures, 2 tables)

This paper contains 51 sections, 18 theorems, 98 equations, 7 figures, 2 tables.

Key Result

Lemma 2.1

Assume ${\mathcal{K}}, {\mathcal{K}}^*$ is a closed convex conjugate pair and $\nabla{\mathcal{K}}$, $\nabla{\mathcal{K}}^*$ are their subgradients, we have (∇K(x) - ∇K(y)) (x-y) ≥ 0, (∇K(x) - y) (x - ∇K^*(y)) ≥ 0.

Figures (7)

  • Figure 1: (a)-(c) Trajectories of Lion on 2D function $f(x) = (x_1 - 1.5)^2 + x_2^2$, with $\lambda = 1.5$ and $\lambda = 0.5$ ((a)-(c)). The boxes in a) represent the constraint set : blue box is for $\left\lVert x\right\rVert_\infty \leq 1/\lambda$ with $\lambda = 0.5$, green box is for $\lambda = 1.5$. (d) $\lambda$ vs. the converged loss We can see that the converged loss starts to increase only when $\lambda$ excel a threshold ($\lambda\geq 0.6$) to excluded the unconstrained minimum from the constrained set.
  • Figure 2: Histograms of the network parameters of ResNet-18 on CIFAR-10 trained by Lion with $\lambda = 10$. The constraint of $\left\lVert x\right\rVert_\infty\leq 1/\lambda$ (indicated by the red vertical lines) is satisfied within only $\sim$200 steps.
  • Figure 3: Evolution of histogram of parameter weights trained by Lion on ResNet-18 on CIFAR-10 he2016deepkrizhevsky2009learning, with different $\lambda$ and initialization methods. Frequency of network parameters in ResNet on the CIFAR-10 dataset across iterations. (a): Kaiming uniform initialization he2015delving and $\lambda = 20$. (b): Kaiming normal initialization he2015delving and $\lambda = 20$. (c): Kaiming uniform initialization he2015delving and $\lambda = 0$. (d): Kaiming normal initialization he2015delving and $\lambda = 0$. The weights are quickly confined into the bound $[-0.05, 0.05]$ with $\lambda =20$, while keep growing with zero weight decay ($\lambda =0$).
  • Figure 4: Analysis of weight decay on CIFAR-10 using Lion. a) The converged Loss vs. weight decay in Lion. We can see that the loss starts to increase only when $\lambda$ excel a threshold, which is expected from the constrained optimization view. b) The loss curves vs. epochs with different weight decays. Larger weight decay $\lambda$ yields faster convergence (due to stronger Phase 1), but may yield larger final loss when it is too large.
  • Figure 5: The behavior of Lion-${\mathcal{K}}$ with different ${\mathcal{K}}$s from Table \ref{['tab:phiexamples']}. The blue trajectory always reaches the optimum as the optimum is included in the constraint. The green trajectory converges to the boundary of the constraint.
  • ...and 2 more figures

Theorems & Definitions (37)

  • Lemma 2.1
  • Example 2.2
  • Theorem 3.1
  • proof : Proof Sketch
  • Theorem 4.1
  • proof
  • Lemma B.1
  • proof
  • Theorem B.2
  • proof
  • ...and 27 more