Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses

Steffen Dereich; Arnulf Jentzen; Adrian Riekert

Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses

Steffen Dereich, Arnulf Jentzen, Adrian Riekert

TL;DR

The paper tackles the fundamental issue that SGD and common adaptive methods can diverge or stall when learning rates do not decay to zero. It introduces a learning-rate-adaptive framework, including a variant of Adam, that adjusts step sizes based on empirical objective estimates and incorporates a practical LR grid search. The authors provide a rigorous convergence analysis via invariant-measure techniques and a general theory for SGD with random, predictable learning rates, complemented by a detailed quadratic-case result. Numerically, the adaptive LR approach accelerates convergence across high-dimensional PDE solvers (DKM), PINNs, and the Deep Ritz Method, often outperforming fixed LR baselines. Overall, the work links theoretical convergence guarantees with practical gains in deep learning-based PDE solvers, offering a principled method for LR control in complex optimization landscapes.

Abstract

It is known that the standard stochastic gradient descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam optimizer fail to converge if the learning rates do not converge to zero (as, for example, in the situation of constant learning rates). Numerical simulations often use human-tuned deterministic learning rate schedules or small constant learning rates. The default learning rate schedules for SGD optimization methods in machine learning implementation frameworks such as TensorFlow and Pytorch are constant learning rates. In this work we propose and study a learning-rate-adaptive approach for SGD optimization methods in which the learning rate is adjusted based on empirical estimates for the values of the objective function of the considered optimization problem (the function that one intends to minimize). In particular, we propose a learning-rate-adaptive variant of the Adam optimizer and implement it in case of several neural network learning problems, particularly, in the context of deep learning approximation methods for partial differential equations such as deep Kolmogorov methods, physics-informed neural networks, and deep Ritz methods. In each of the presented learning problems the proposed learning-rate-adaptive variant of the Adam optimizer faster reduces the value of the objective function than the Adam optimizer with the default learning rate. For a simple class of quadratic minimization problems we also rigorously prove that a learning-rate-adaptive variant of the SGD optimization method converges to the minimizer of the considered minimization problem. Our convergence proof is based on an analysis of the laws of invariant measures of the SGD method as well as on a more general convergence analysis for SGD with random but predictable learning rates which we develop in this work.

Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses

TL;DR

Abstract

Paper Structure (30 sections, 34 theorems, 355 equations, 8 figures)

This paper contains 30 sections, 34 theorems, 355 equations, 8 figures.

Introduction
Algorithmic descriptions
Learning-rate-adaptive plain-vanilla SGD methods
Learning-rate-adaptive general SGD methods
Numerical experiments
Supervised learning problem
Deep Kolmogorov method (DKM)
DKM for a heat PDE
DKM for a Black--Scholes PDE
DKM for a stochastic Lorenz equation
Physics-informed neural networks (PINNs)
PINNs for a sine-Gordon type PDE
PINNs for an Allen-Cahn PDE
Deep Ritz method for a Poisson equation
Convergence analysis of SGD with random learning rates
...and 15 more sections

Key Result

Theorem 1.1

Let $d \in \mathbb{N}$, $a \in \mathbb{R}$, $b \in (a,\infty)$, let $( \Omega, \mathcal{F}, \mathbb{P} )$ be a probability space, let $X_{ n, m } \colon \Omega \to [a,b]^{ d }$, $(n, m) \in \mathbb{Z}^2$, be i.i.d. random variables, assumeNote that for all $n \in \mathbb{N}$, $x = ( x_1, \dots, x_n let $M, \mathbf{M} \in \mathbb{N}$, let $\Theta \colon \mathbb{N}_0 \times \Omega \to \mathbb{R}^{

Figures (8)

Figure 1: Numerical results for the supervised learning problem from \ref{['subsec:simple_supervised']} with the target function in \ref{['eq:simple_supervised']}.
Figure 2: Numerical results for the heat in \ref{['eq:kolmogorov_heat']} using the .
Figure 3: Numerical results for the Black-Scholes in \ref{['eq:kolmogorov_black_scholes']} using the .
Figure 4: Numerical results for the Lorenz in \ref{['eq:kolmogorov_lorenz']} using the .
Figure 5: Numerical results for the sine-Gordon type in \ref{['eq:pinn_sine']} using .
...and 3 more figures

Theorems & Definitions (38)

Theorem 1.1: Convergence of with adaptive learning rates
Remark 2.2
Lemma 4.1: Bounds for the linear implicit Euler approximation
Corollary 4.2: Bounds for the linear implicit Euler approximation
Lemma 4.3: Monotone convergence in probability
Lemma 4.4: Almost sure convergence implies convergence in probability
Lemma 4.5: Characterization of convergence in probability
Proposition 4.6: Convergence in probability
Lemma 4.7: Augmentation of measurable spaces
Corollary 4.8: Completion of probability spaces
...and 28 more

Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses

TL;DR

Abstract

Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (38)