Table of Contents
Fetching ...

Acceleration by Random Stepsizes: Hedging, Equalization, and the Arcsine Stepsize Schedule

Jason M. Altschuler, Pablo A. Parrilo

TL;DR

The paper shows that Gradient Descent can be fully accelerated for separable convex functions by using i.i.d. inverse stepsizes drawn from the Arcsine distribution over $(m,M)$. This random-stepsize strategy achieves the optimal accelerated rate $R_{acc}=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}$, yielding an iteration complexity of $O(\sqrt{\kappa}\log(1/\varepsilon))$ without momentum. The analysis connects to logarithmic potential theory via an equalization property: the expected progress per iteration is constant across curvatures, and a martingale argument extends the result from quadratics to the broader class of separable convex functions. The results imply potential benefits of randomized stepsizes, including parallelization opportunities, while also addressing stability under inexact gradients and offering a game-theoretic perspective on lower bounds. These findings raise intriguing questions about derandomization, applicability beyond separability, and extensions to other spectral structures.

Abstract

We show that for separable convex optimization, random stepsizes fully accelerate Gradient Descent. Specifically, using inverse stepsizes i.i.d. from the Arcsine distribution improves the iteration complexity from $O(k)$ to $O(k^{1/2})$, where $k$ is the condition number. No momentum or other algorithmic modifications are required. This result is incomparable to the (deterministic) Silver Stepsize Schedule which does not require separability but only achieves partial acceleration $O(k^{\log_{1+\sqrt{2}} 2}) \approx O(k^{0.78})$. Our starting point is a conceptual connection to potential theory: the variational characterization for the distribution of stepsizes with fastest convergence rate mirrors the variational characterization for the distribution of charged particles with minimal logarithmic potential energy. The Arcsine distribution solves both variational characterizations due to a remarkable "equalization property" which in the physical context amounts to a constant potential over space, and in the optimization context amounts to an identical convergence rate over all quadratic functions. A key technical insight is that martingale arguments extend this phenomenon to all separable convex functions. We interpret this equalization as an extreme form of hedging: by using this random distribution over stepsizes, Gradient Descent converges at exactly the same rate for all functions in the function class.

Acceleration by Random Stepsizes: Hedging, Equalization, and the Arcsine Stepsize Schedule

TL;DR

The paper shows that Gradient Descent can be fully accelerated for separable convex functions by using i.i.d. inverse stepsizes drawn from the Arcsine distribution over . This random-stepsize strategy achieves the optimal accelerated rate , yielding an iteration complexity of without momentum. The analysis connects to logarithmic potential theory via an equalization property: the expected progress per iteration is constant across curvatures, and a martingale argument extends the result from quadratics to the broader class of separable convex functions. The results imply potential benefits of randomized stepsizes, including parallelization opportunities, while also addressing stability under inexact gradients and offering a game-theoretic perspective on lower bounds. These findings raise intriguing questions about derandomization, applicability beyond separability, and extensions to other spectral structures.

Abstract

We show that for separable convex optimization, random stepsizes fully accelerate Gradient Descent. Specifically, using inverse stepsizes i.i.d. from the Arcsine distribution improves the iteration complexity from to , where is the condition number. No momentum or other algorithmic modifications are required. This result is incomparable to the (deterministic) Silver Stepsize Schedule which does not require separability but only achieves partial acceleration . Our starting point is a conceptual connection to potential theory: the variational characterization for the distribution of stepsizes with fastest convergence rate mirrors the variational characterization for the distribution of charged particles with minimal logarithmic potential energy. The Arcsine distribution solves both variational characterizations due to a remarkable "equalization property" which in the physical context amounts to a constant potential over space, and in the optimization context amounts to an identical convergence rate over all quadratic functions. A key technical insight is that martingale arguments extend this phenomenon to all separable convex functions. We interpret this equalization as an extreme form of hedging: by using this random distribution over stepsizes, Gradient Descent converges at exactly the same rate for all functions in the function class.

Paper Structure

This paper contains 36 sections, 22 theorems, 94 equations, 3 figures, 1 table.

Key Result

Theorem 1.2

Consider any dimension $d$, any separable function $f : \mathbb{R}^d \to \mathbb{R}$ that is $m$-strongly convex and $M$-smooth, and any initialization point $x_0$ that is not equal to the minimizer $x^*$ of $f$. By using i.i.d. inverse stepsizes $\alpha_t^{-1}$ from the Arcsine distribution eq-intr where this convergence in the almost sure and $L^1$ sense. Moreover, this is the unique distributio

Figures (3)

  • Figure 1: The induced distribution of stepsizes $\alpha$, for inverse stepsizes $\alpha^{-1}$ taken from the Arcsine distribution on $(m,M)$. The minimum stepsize is $1/M$, the maximum is $1/m$, and the median is $2/(M+m)$, which is the optimal value for constant stepsize schedules. For constant stepsize schedules, the dashed red line $\bar{\alpha} = 2/M$ is the threshold for convergence; larger stepsizes $\alpha$ lead to divergence, and short stepsizes $\alpha$ lead to slow convergence. This distribution optimally hedges between short and long steps. Note that the mean stepsize is $1/\sqrt{Mm}$ which is larger than the divergence threshold for constant stepsize schedules, by a factor of $\Theta(\sqrt{\kappa})$. This plot sets $\kappa = 10$, $m=1/\kappa$, $M=1$; the discrepancy between this distribution and standard constant stepsizes is even more dramatic for larger $\kappa$.
  • Figure 2: As $n \to \infty$, the empirical distribution (yellow) of the $n$-step inverse Chebyshev stepsize schedule \ref{['eq:quad-cheb']} converges to the Arcsine distribution (blue). The fit is increasingly accurate as $n$ increases from $10^2$ (left) to $10^4$ (right). Demonstrated here for $m=1$ and $M=10$.
  • Figure 3: The randomness of the proposed stepsizes leads to variability between runs. Illustrated here with a boxplot for $500$ runs ($7$ full trajectories shown) from the same initialization on a univariate quadratic function, with $\kappa = 200$; similar phenomena occur for other problem instances. The horizontal axis is the number of iterations $n$; the vertical axis is $\log_{10} \tfrac{\|x_n - x^*\|}{\|x_0 - x^*\|}$. The worst run can diverge exponentially, but this occurs with exponentially low probability (details in §\ref{['ssec:discussion:notions']}). The median run converges exponentially at the optimal accelerated rate $\mathop{\mathrm{\mathrm{R_{acc}}}}\nolimits$ (Theorem \ref{['thm:sep:main']}). The fit is extremely precise, as shown by the bolded blue line. The best run can be substantially better than the median run, leading to the possibility of faster parallel optimization (details in §\ref{['ssec:discussion:parallel']}).

Theorems & Definitions (45)

  • Definition 1.1: Separable functions
  • Theorem 1.2: Random stepsizes accelerate GD for separable convex optimization
  • Lemma 2.1: Extremal property of the Arcsine distribution
  • Lemma 2.2: Equalization property of the Arcsine distribution
  • Remark 2.3: Overcoming unpredictability via random hedging
  • Lemma 3.1
  • proof
  • proof : Proof of Theorem \ref{['thm:sep:main']} for univariate $f$
  • Lemma 3.2: Martingale helper lemma
  • Lemma 3.3: Kronecker's Lemma
  • ...and 35 more