Table of Contents
Fetching ...

Accelerated Gradient Methods with Biased Gradient Estimates: Risk Sensitivity, High-Probability Guarantees, and Large Deviation Bounds

Mert Gürbüzbalaban, Yasa Syed, Necdet Serhat Aybat

TL;DR

The paper develops a risk-sensitive framework to quantify robustness of generalized momentum methods to biased gradient errors in smooth strongly convex optimization. It derives exact RSI expressions for quadratic objectives via a dimension-reduced 2\times2 Riccati formulation, and proves a large-deviation principle that connects the rate function to RSI and the $H_\infty$-norm, tying tail behavior to worst-case robustness. It extends these results to general strongly convex objectives with biased sub-Gaussian noise, offering finite-time high-probability bounds and non-asymptotic LDPs for ergodic averages, along with Lyapunov-based analysis of stability. Numerically, the work demonstrates Pareto-frontier trade-offs between convergence rate and risk across GD, HB, NAG, and TMM, and validates tail-bound guarantees in robust regression settings, highlighting the practical implications for parameter tuning under noisy gradient information.

Abstract

We study trade-offs between convergence rate and robustness to gradient errors in the context of first-order methods. Our focus is on generalized momentum methods (GMMs)--a broad class that includes Nesterov's accelerated gradient, heavy-ball, and gradient descent methods--for minimizing smooth strongly convex objectives. We allow stochastic gradient errors that may be adversarial and biased, and quantify robustness of these methods to gradient errors via the risk-sensitive index (RSI) from robust control theory. For quadratic objectives with i.i.d. Gaussian noise, we give closed form expressions for RSI in terms of solutions to 2x2 matrix Riccati equations, revealing a Pareto frontier between RSI and convergence rate over the choice of step-size and momentum parameters. We then prove a large-deviation principle for time-averaged suboptimality in the large iteration limit and show that the rate function is, up to a scaling, the convex conjugate of the RSI function. We further show that the rate function and RSI are linked to the $H_\infty$-norm--a measure of robustness to the worst-case deterministic gradient errors--so that stronger worst-case robustness (smaller $H_\infty$-norm) leads to sharper decay of the tail probabilities for the average suboptimality. Beyond quadratics, under potentially biased sub-Gaussian gradient errors, we derive non-asymptotic bounds on a finite-time analogue of the RSI, yielding finite-time high-probability guarantees and non-asymptotic large-deviation bounds for the averaged iterates. In the case of smooth strongly convex functions, we also observe an analogous trade-off between RSI and convergence-rate bounds. To our knowledge, these are the first non-asymptotic guarantees for GMMs with biased gradients and the first risk-sensitive analysis of GMMs. Finally, we provide numerical experiments on a robust regression problem to illustrate our results.

Accelerated Gradient Methods with Biased Gradient Estimates: Risk Sensitivity, High-Probability Guarantees, and Large Deviation Bounds

TL;DR

The paper develops a risk-sensitive framework to quantify robustness of generalized momentum methods to biased gradient errors in smooth strongly convex optimization. It derives exact RSI expressions for quadratic objectives via a dimension-reduced 2\times2 Riccati formulation, and proves a large-deviation principle that connects the rate function to RSI and the -norm, tying tail behavior to worst-case robustness. It extends these results to general strongly convex objectives with biased sub-Gaussian noise, offering finite-time high-probability bounds and non-asymptotic LDPs for ergodic averages, along with Lyapunov-based analysis of stability. Numerically, the work demonstrates Pareto-frontier trade-offs between convergence rate and risk across GD, HB, NAG, and TMM, and validates tail-bound guarantees in robust regression settings, highlighting the practical implications for parameter tuning under noisy gradient information.

Abstract

We study trade-offs between convergence rate and robustness to gradient errors in the context of first-order methods. Our focus is on generalized momentum methods (GMMs)--a broad class that includes Nesterov's accelerated gradient, heavy-ball, and gradient descent methods--for minimizing smooth strongly convex objectives. We allow stochastic gradient errors that may be adversarial and biased, and quantify robustness of these methods to gradient errors via the risk-sensitive index (RSI) from robust control theory. For quadratic objectives with i.i.d. Gaussian noise, we give closed form expressions for RSI in terms of solutions to 2x2 matrix Riccati equations, revealing a Pareto frontier between RSI and convergence rate over the choice of step-size and momentum parameters. We then prove a large-deviation principle for time-averaged suboptimality in the large iteration limit and show that the rate function is, up to a scaling, the convex conjugate of the RSI function. We further show that the rate function and RSI are linked to the -norm--a measure of robustness to the worst-case deterministic gradient errors--so that stronger worst-case robustness (smaller -norm) leads to sharper decay of the tail probabilities for the average suboptimality. Beyond quadratics, under potentially biased sub-Gaussian gradient errors, we derive non-asymptotic bounds on a finite-time analogue of the RSI, yielding finite-time high-probability guarantees and non-asymptotic large-deviation bounds for the averaged iterates. In the case of smooth strongly convex functions, we also observe an analogous trade-off between RSI and convergence-rate bounds. To our knowledge, these are the first non-asymptotic guarantees for GMMs with biased gradients and the first risk-sensitive analysis of GMMs. Finally, we provide numerical experiments on a robust regression problem to illustrate our results.

Paper Structure

This paper contains 18 sections, 21 theorems, 204 equations, 7 figures, 1 table.

Key Result

Theorem 3.4

[theorem]thm-h-inf Consider a quadratic function $f \in \mathcal{C}_\mu^L(\mathbb{R}^d)$ with a Hessian matrix $Q\in\mathbb{R}^{d\times d}$. Suppose that with the given parameters $(\alpha, \beta, \nu)$, the GMM dynamics exhibit global linear convergence to the unique fixed point of the system $\xi_ where $\mu= \lambda_1 \leq \lambda_2 \leq \dots \lambda_d = L$ are the eigenvalues of $Q$ in a non-

Figures (7)

  • Figure 1: Risk sensitivity for common parameterizations of GMM methods; $d = 2, L = 3, \mu = 1, \sigma^2 = 2$.
  • Figure 2: (a) $H_\infty$ as a function of $\alpha$ based on \ref{['eq-hinfty-gd-formula']} in the same problem setting; (b) The risk-sensitive index as a function of GD step-size $\alpha$ for $\theta \approx \{4/3,~4\}$, where the objective is a quadratic with $d = 2, L = 3, \mu = 1$, and $\sigma^2 = 2$.
  • Figure 3: (a) Logarithm of the risk-sensitive index $R(\theta)$ as a function of the parameters $\alpha,\beta$ of HB with $\theta = 1.2$, $\sigma^2=2$,$\nu=0$. (b) Convergence rate $\rho$ as a function of $\alpha,\beta$. The objective is a quadratic with $d = 2, L = 3,\mu = 1$.
  • Figure 4: (a) Logarithm of the risk-sensitive index $R(\theta)$ as a function of the parameters $\alpha,\beta$ of NAG with $\theta = 3.7$, $\sigma^2=2$,$\nu=\beta$. (b) Convergence rate $\rho$ as a function of $\alpha,\beta$, where$\beta_*(\alpha) = \frac{1 - \sqrt{\alpha\mu}}{1+\sqrt{\alpha\mu}}$ for $\alpha \in [0,\frac{1}{L}]$. The objective is a quadratic with $d = 2, L = 3,\mu = 1$.
  • Figure 5: Pareto boundary for GD, NAG, HB illustrating the trade-off between risk and rate for a quadratic objective with $d = 2, L = 3, \mu = 1, \sigma^2 = 2, \theta = 0.2$.
  • ...and 2 more figures

Theorems & Definitions (47)

  • Remark 3.2
  • Definition 3.3: tran2017qualitativelin1996hvan2016l2fleming1995riskgurbuzbalaban2023robustly
  • Theorem 3.4: gurbuzbalaban2023robustly
  • Theorem 4.1
  • proof
  • Remark 4.2
  • Corollary 4.3
  • proof
  • Proposition 4.4
  • proof
  • ...and 37 more