Table of Contents
Fetching ...

Gradient Descent for Convex and Smooth Noisy Optimization

Feifei Hu, Mathieu Gerber

TL;DR

The paper addresses noisy convex optimization where the objective F(θ)=E[f(θ,Z)] is strictly convex but not globally L-smooth. It introduces gradient descent with backtracking line search (GD-BLS) as a robust optimizer under weak smoothness and finite-moment assumptions, and shows that a single-stage sample-average approach yields a rate of O_P(B^{-1/4}) when using n(B)~B^{1/2}. To accelerate learning, it proposes a retrospective multi-stage refinement (J steps) that reallocates remaining budget to progressively finer approximations of F, achieving rates of O_P(B^{- rac{1}{2}(1- heta)}) with θ=δ^J, and a generalized rate O_P(B^{- rac{α}{1+α}(1-ig( rac{2α}{1+3α}ig)^J)}) for α∈(0,1]. The results hold without tuning parameters to the specific F or f, and are illustrated by Poisson-regression-inspired examples where SG can fail but GD-BLS succeeds, highlighting practical robustness. Overall, the work provides a near-optimal, budget-aware framework for noisy optimization beyond the standard L-smooth regime, with meaningful implications for large-scale statistical learning where smoothness and variance conditions may be violated.

Abstract

We study the use of gradient descent with backtracking line search (GD-BLS) to solve the noisy optimization problem $θ_\star:=\mathrm{argmin}_{θ\in\mathbb{R}^d} \mathbb{E}[f(θ,Z)]$, imposing that the function $F(θ):=\mathbb{E}[f(θ,Z)]$ is strictly convex but not necessarily $L$-smooth. Assuming that $\mathbb{E}[\|\nabla_θf(θ_\star,Z)\|^2]<\infty$, we first prove that sample average approximation based on GD-BLS allows to estimate $θ_\star$ with an error of size $\mathcal{O}_{\mathbb{P}}(B^{-0.25})$, where $B$ is the available computational budget. We then show that we can improve upon this rate by stopping the optimization process earlier when the gradient of the objective function is sufficiently close to zero, and use the residual computational budget to optimize, again with GD-BLS, a finer approximation of $F$. By iteratively applying this strategy $J$ times, we establish that we can estimate $θ_\star$ with an error of size $\mathcal{O}_{\mathbb{P}}(B^{-\frac{1}{2}(1-δ^{J})})$, where $δ\in(1/2,1)$ is a user-specified parameter. More generally, we show that if $\mathbb{E}[\|\nabla_θf(θ_\star,Z)\|^{1+α}]<\infty$ for some known $α\in (0,1]$ then this approach, which can be seen as a retrospective approximation algorithm with a fixed computational budget, allows to learn $θ_\star$ with an error of size $\mathcal{O}_{\mathbb{P}}(B^{-\fracα{1+α}(1-δ^{J})})$, where $δ\in(2α/(1+3α),1)$ is a tuning parameter. Beyond knowing $α$, achieving the aforementioned convergence rates do not require to tune the algorithms parameters according to the specific functions $F$ and $f$ at hand, and we exhibit a simple noisy optimization problem for which stochastic gradient is not guaranteed to converge while the algorithms discussed in this work are.

Gradient Descent for Convex and Smooth Noisy Optimization

TL;DR

The paper addresses noisy convex optimization where the objective F(θ)=E[f(θ,Z)] is strictly convex but not globally L-smooth. It introduces gradient descent with backtracking line search (GD-BLS) as a robust optimizer under weak smoothness and finite-moment assumptions, and shows that a single-stage sample-average approach yields a rate of O_P(B^{-1/4}) when using n(B)~B^{1/2}. To accelerate learning, it proposes a retrospective multi-stage refinement (J steps) that reallocates remaining budget to progressively finer approximations of F, achieving rates of O_P(B^{- rac{1}{2}(1- heta)}) with θ=δ^J, and a generalized rate O_P(B^{- rac{α}{1+α}(1-ig( rac{2α}{1+3α}ig)^J)}) for α∈(0,1]. The results hold without tuning parameters to the specific F or f, and are illustrated by Poisson-regression-inspired examples where SG can fail but GD-BLS succeeds, highlighting practical robustness. Overall, the work provides a near-optimal, budget-aware framework for noisy optimization beyond the standard L-smooth regime, with meaningful implications for large-scale statistical learning where smoothness and variance conditions may be violated.

Abstract

We study the use of gradient descent with backtracking line search (GD-BLS) to solve the noisy optimization problem , imposing that the function is strictly convex but not necessarily -smooth. Assuming that , we first prove that sample average approximation based on GD-BLS allows to estimate with an error of size , where is the available computational budget. We then show that we can improve upon this rate by stopping the optimization process earlier when the gradient of the objective function is sufficiently close to zero, and use the residual computational budget to optimize, again with GD-BLS, a finer approximation of . By iteratively applying this strategy times, we establish that we can estimate with an error of size , where is a user-specified parameter. More generally, we show that if for some known then this approach, which can be seen as a retrospective approximation algorithm with a fixed computational budget, allows to learn with an error of size , where is a tuning parameter. Beyond knowing , achieving the aforementioned convergence rates do not require to tune the algorithms parameters according to the specific functions and at hand, and we exhibit a simple noisy optimization problem for which stochastic gradient is not guaranteed to converge while the algorithms discussed in this work are.
Paper Structure (45 sections, 20 theorems, 172 equations, 3 figures, 2 algorithms)

This paper contains 45 sections, 20 theorems, 172 equations, 3 figures, 2 algorithms.

Key Result

Proposition 1

Consider the $d=1$ dimensional optimization problem eq:optim_prob where $Z=(X,Y)$, with $X$ and $Y$ two independent Poisson random variables such that $\mathbb{E}[X]=\mathbb{E}[Y]=1$, and where the function $f:\mathbb{R}^d\times\mathsf{Z}\rightarrow\mathbb{R}$ is defined by Then, the function $F$ is strictly convex, twice continuously differentiable and $\mathrm{Var}(\nabla_\theta f(\theta,Z))<\i

Figures (3)

  • Figure 1: Results for the example of Section \ref{['sub:ex1']}. The left plot shows $\mathbb{E}[\|\hat{\theta}_{J,B}-\theta_\star\|]$ as a function of $B$, where the solid line is for $\delta=0.95$ and the dashed line for $\delta=0.51$, and with the two dotted line representing the $B^{-1/2}$ convergence rate. The middle and right plots show the evolution of $\mathbb{E}[J_{B}]$ as a function of $B$ for $\delta=0.95$ (middle plot) and for $\delta=0.51$ (right plot). All the results are obtained from 100 independent realizations of $(Z_i)_{i\geq 1}$.
  • Figure 2: Results for the example of Section \ref{['sub:ex2']}. The left plot shows $\mathbb{E}[\|\hat{\theta}_{J,B}-\theta_\star\|]$ as a function of $B$, where the solid line is for $\delta=0.95$ and the dashed line for $\delta=0.51$, and with the dotted line representing the $B^{-1/2}$ convergence rate. The middle and right plots show the evolution of $\mathbb{E}[J_{B}]$ as a function of $B$ for $\delta=0.95$ (middle plot) and for $\delta=0.51$ (right plot). All the results are obtained from 100 independent realizations of $(Z_i)_{i\geq 1}$
  • Figure 3: Results for the example of Section \ref{['sub:ex3']}. The top plots shows $\mathbb{E}[\|\hat{\theta}_{J,B}-\theta_\star\|_{0.1}]$ as a function of $B$ and the bottom plots show $\mathbb{E}[J_{B}]$ as a function of $B$. The left plots are for $(\alpha',\delta)=(0,5,0.41)$, the middle plots for $(\alpha',\delta)=(0.5,0.95)$ and the right plots for $(\alpha',\delta)=(1,0.95)$. In the top plots the dotted lines represent the $B^{-1/3}$ convergence rate and all the results are obtained from 100 independent realizations of $(Z_i)_{i\geq 1}$

Theorems & Definitions (38)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Corollary 1
  • Lemma 5
  • Proposition 4
  • ...and 28 more