Table of Contents
Fetching ...

Corner Gradient Descent

Dmitry Yarotsky

TL;DR

It is shown that rates up to $O(t^{-2\zeta})$ can be achieved by a generalized stationary SGD with infinite memory, and it is proved that the optimal rate is given by $\theta_{\max}=\min(2,\nu,\tfrac{2}{\zeta+1/\nu})$, where $\nu,\zeta$ are the exponents appearing in the capacity and source spectral conditions.

Abstract

We consider SGD-type optimization on infinite-dimensional quadratic problems with power law spectral conditions. It is well-known that on such problems deterministic GD has loss convergence rates $L_t=O(t^{-ζ})$, which can be improved to $L_t=O(t^{-2ζ})$ by using Heavy Ball with a non-stationary Jacobi-based schedule (and the latter rate is optimal among fixed schedules). However, in the mini-batch Stochastic GD setting, the sampling noise causes the Jacobi HB to diverge; accordingly no $O(t^{-2ζ})$ algorithm is known. In this paper we show that rates up to $O(t^{-2ζ})$ can be achieved by a generalized stationary SGD with infinite memory. We start by identifying generalized (S)GD algorithms with contours in the complex plane. We then show that contours that have a corner with external angle $θπ$ accelerate the plain GD rate $O(t^{-ζ})$ to $O(t^{-θζ})$. For deterministic GD, increasing $θ$ allows to achieve rates arbitrarily close to $O(t^{-2ζ})$. However, in Stochastic GD, increasing $θ$ also amplifies the sampling noise, so in general $θ$ needs to be optimized by balancing the acceleration and noise effects. We prove that the optimal rate is given by $θ_{\max}=\min(2,ν,\tfrac{2}{ζ+1/ν})$, where $ν,ζ$ are the exponents appearing in the capacity and source spectral conditions. Furthermore, using fast rational approximations of the power functions, we show that ideal corner algorithms can be efficiently approximated by finite-memory algorithms, and demonstrate their practical efficiency on a synthetic problem and MNIST.

Corner Gradient Descent

TL;DR

It is shown that rates up to can be achieved by a generalized stationary SGD with infinite memory, and it is proved that the optimal rate is given by , where are the exponents appearing in the capacity and source spectral conditions.

Abstract

We consider SGD-type optimization on infinite-dimensional quadratic problems with power law spectral conditions. It is well-known that on such problems deterministic GD has loss convergence rates , which can be improved to by using Heavy Ball with a non-stationary Jacobi-based schedule (and the latter rate is optimal among fixed schedules). However, in the mini-batch Stochastic GD setting, the sampling noise causes the Jacobi HB to diverge; accordingly no algorithm is known. In this paper we show that rates up to can be achieved by a generalized stationary SGD with infinite memory. We start by identifying generalized (S)GD algorithms with contours in the complex plane. We then show that contours that have a corner with external angle accelerate the plain GD rate to . For deterministic GD, increasing allows to achieve rates arbitrarily close to . However, in Stochastic GD, increasing also amplifies the sampling noise, so in general needs to be optimized by balancing the acceleration and noise effects. We prove that the optimal rate is given by , where are the exponents appearing in the capacity and source spectral conditions. Furthermore, using fast rational approximations of the power functions, we show that ideal corner algorithms can be efficiently approximated by finite-memory algorithms, and demonstrate their practical efficiency on a synthetic problem and MNIST.

Paper Structure

This paper contains 42 sections, 13 theorems, 186 equations, 6 figures.

Key Result

Theorem 1

Let numbers $L_t$ be given by expansion eq:lexpstatio with some $U_t\ge 0,V_t\ge 0.$ Let $U_{\Sigma}=\sum_{t=1}^\infty U_t$ and $V_{\Sigma}=\sum_{t=1}^\infty V_t.$

Figures (6)

  • Figure 1: Left: The phase diagram of stationary finite-memory SGD from velikanovviewyarotsky2024sgd. Right: Maximum acceleration factor $\theta_{\max}=\min(2,\nu,\tfrac{2}{\zeta+1/\nu})$ for Corner SGD in the signal-dominated regime (see Theorem \ref{['th:thetamax']}).
  • Figure 2: Left: The map $\Psi=\tfrac{P}{Q}$ for Heavy Ball with $P(\mu)=(\mu-1)(\mu-0.4)$ and $Q(\mu)=-\mu$. The contour $\gamma=\Psi(\{\mu:|\mu|=1\})$ encircles $\operatorname{spec}(\mathbf H)$. The map $\Psi$ bijectively maps $\{\mu\in\mathbb C:|\mu|> 1\}$ to the exterior open domain $\mathcal{D}_\gamma$ with boundary $\gamma$. (See Section \ref{['sec:mem1contours']} for a general discussion of memory-1 contours.) Right: Contour $\gamma$ corresponding to a corner map $\Psi$ with angle $\theta\pi$.
  • Figure 3: Training loss and final predictions of the linear model \ref{['eq:model1d']} trained to fit the target $y(x)=\mathbf 1_{[1/4,3/4]}(x)$ using either plain or corner SGD with batch size $|B|=100$. The loss trajectories oscillate strongly, so their smoothed versions are also shown and used to estimate the exponents $\zeta$ in power laws $L_t\propto t^{-\zeta}$. The corner SGD has $\theta=1.8$.
  • Figure 4: Training loss of neural network \ref{['eq:modelmnist']} on MNIST classification with $H=1000$, with batch size $|B|=1000$ (left) or $100$ (right). The full color curves show the smoothed losses.
  • Figure 5: Contours $\gamma=\Psi(\{\mu:|\mu|=1\})$ corresponding to different memory-1 maps $\Psi$ (see Section \ref{['sec:mem1contours']}). Left: plain Gradient Descent (a circle). Center: Heavy Ball (an ellipse; $\beta=0.5$). Right: general memory-1 algorithms (a Zhukovsky airfoil; $\beta=0.65, q_0=0.125, q_1=-1$).
  • ...and 1 more figures

Theorems & Definitions (20)

  • Theorem 1: yarotsky2024sgd
  • Theorem 2
  • Theorem 3: \ref{['sec:proofmain']}
  • Theorem 4
  • Proposition 1: \ref{['sec:proofpsitemplate']}
  • Proposition 2: \ref{['sec:proofdiscrcorner']}
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • ...and 10 more