Table of Contents
Fetching ...

Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate

Zhiqi Bu, Shiyun Xu, Jialin Mao

TL;DR

This paper investigates convex-like dynamics in deep learning and proposes a framework to control loss using learning-rate schedules. It derives non-asymptotic and asymptotic bounds that map learning-rate sequences to loss, then extends these ideas to deep learning with an abstract loss form and data-driven fitting. A two-dimensional scaling law is introduced, predicting both final loss and optimal learning rate across horizons and model sizes, and validated across SGD, adaptive optimizers, and multi-modal models with strong predictive accuracy. The work offers a practical, data-driven approach to schedule selection and hyperparameter transfer, though it notes limitations in predicting test loss and calls for deeper understanding of why convex-like behavior emerges in DL.

Abstract

Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as 80X across training horizons and 70X across model sizes.

Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate

TL;DR

This paper investigates convex-like dynamics in deep learning and proposes a framework to control loss using learning-rate schedules. It derives non-asymptotic and asymptotic bounds that map learning-rate sequences to loss, then extends these ideas to deep learning with an abstract loss form and data-driven fitting. A two-dimensional scaling law is introduced, predicting both final loss and optimal learning rate across horizons and model sizes, and validated across SGD, adaptive optimizers, and multi-modal models with strong predictive accuracy. The work offers a practical, data-driven approach to schedule selection and hyperparameter transfer, though it notes limitations in predicting test loss and calls for deeper understanding of why convex-like behavior emerges in DL.

Abstract

Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as 80X across training horizons and 70X across model sizes.
Paper Structure (46 sections, 3 theorems, 42 equations, 13 figures, 7 tables)

This paper contains 46 sections, 3 theorems, 42 equations, 13 figures, 7 tables.

Key Result

Corollary 2.4

Consider a learning rate schedule and $\eta_\textup{peak}$ that satisfy eq:peak and loss for SGD under def:convex and Lipschitz.

Figures (13)

  • Figure 1: Upper bound of SGD loss in \ref{['eq:last lr array']} with peak learning rate $\eta_\textup{peak}=1/\sqrt{T}$. Left-most is linear decaying schedule. Center is constant schedule.
  • Figure 2: Sequence-to-sequence prediction by \ref{['eq:last lr array DL']} for ResNet18 on ImageNet with SGD.
  • Figure 3: Sequence-to-sequence prediction by \ref{['eq:last lr array DL']} for ResNet18 on ImageNet with AdamW.
  • Figure 4: Sequence-to-sequence prediction by \ref{['eq:last lr array DL']} for GPT2 on OpenWebText with AdamW.
  • Figure 5: Sequence-to-sequence prediction by \ref{['eq:last lr array DL']} for GPT2 on OpenWebText with Muon-NSGD.
  • ...and 8 more figures

Theorems & Definitions (7)

  • Remark 2.2
  • Example 2.3
  • Corollary 2.4
  • Theorem 1
  • proof : Proof of \ref{['col:last-iter']}
  • Theorem 2
  • proof : Proof of \ref{['thm1']}