Table of Contents
Fetching ...

Instance-optimal stochastic convex optimization: Can we improve upon sample-average and robust stochastic approximation?

Liwei Jiang, Ashwin Pananjady

Abstract

We study the unconstrained minimization of a smooth and strongly convex population loss function under a stochastic oracle that introduces both additive and multiplicative noise; this is a canonical and widely-studied setting that arises across operations research, signal processing, and machine learning. We begin by showing that standard approaches such as sample average approximation and robust (or averaged) stochastic approximation can lead to suboptimal -- and in some cases arbitrarily poor -- performance with realistic finite sample sizes. In contrast, we demonstrate that a carefully designed variance reduction strategy, which we term VISOR for short, can significantly outperform these approaches while using the same sample size. Our upper bounds are complemented by finite-sample, information-theoretic local minimax lower bounds, which highlight fundamental, instance-dependent factors that govern the performance of any estimator. Taken together, these results demonstrate that an accelerated variant of VISOR is instance-optimal, achieving the best possible sample complexity up to logarithmic factors while also attaining optimal oracle complexity. We apply our theory to generalized linear models and improve upon classical results. In particular, we obtain the best-known non-asymptotic, instance-dependent generalization error bounds for stochastic methods, even in linear regression.

Instance-optimal stochastic convex optimization: Can we improve upon sample-average and robust stochastic approximation?

Abstract

We study the unconstrained minimization of a smooth and strongly convex population loss function under a stochastic oracle that introduces both additive and multiplicative noise; this is a canonical and widely-studied setting that arises across operations research, signal processing, and machine learning. We begin by showing that standard approaches such as sample average approximation and robust (or averaged) stochastic approximation can lead to suboptimal -- and in some cases arbitrarily poor -- performance with realistic finite sample sizes. In contrast, we demonstrate that a carefully designed variance reduction strategy, which we term VISOR for short, can significantly outperform these approaches while using the same sample size. Our upper bounds are complemented by finite-sample, information-theoretic local minimax lower bounds, which highlight fundamental, instance-dependent factors that govern the performance of any estimator. Taken together, these results demonstrate that an accelerated variant of VISOR is instance-optimal, achieving the best possible sample complexity up to logarithmic factors while also attaining optimal oracle complexity. We apply our theory to generalized linear models and improve upon classical results. In particular, we obtain the best-known non-asymptotic, instance-dependent generalization error bounds for stochastic methods, even in linear regression.

Paper Structure

This paper contains 43 sections, 34 theorems, 223 equations, 4 figures, 3 algorithms.

Key Result

Proposition 3.1

Let $F$ be a quadratic function with Hessian matrix $A$ that satisfies Assumption assum: non-quadratic with parameters $L \ge \mu > 0$ and $L_H = 0$. For any positive semi-definite covariance matrix $\Sigma$ and integer $n \ge 1$, we have In addition, there is a stochastic first-order method such that for any $(f,P) \in \mathcal{N}(n,F,\Sigma)$, when Assumption assum:vr_assum holds with parameter

Figures (4)

  • Figure 1: Heat maps of $\sqrt{n}(\widehat{x}_n^{(\texttt{RPJ})} - x^\star)$ for different $n$. We always initialize the algorithm at the origin (initial distance to minimizer is $\sqrt{2}$) and each heatmap is generated over $10{,}000$ trials.
  • Figure 2: Heat maps of $\sqrt{n}(\widehat{x}_n^{(\texttt{RPJ})} - x^\star)$ for different $\zeta^2$ and sample size $n = 200\zeta^2$. We always initialize the algorithm at the origin (initial distance to minimizer is $\sqrt{2}$) and each heatmap is generated over $10,000$ trials. (Note that $x_1$ and $x_2$ have different scales in the above plots.)
  • Figure 4: Comparison of averaging (constant stepsize) and our algorithm. All the algorithms are initialized at the origin. The total number of samples is $n = 200\cdot \zeta^2$. The error (y-axis) $n\|\widehat{x}_n -x^\star\|_2^2$ is averaged over 100 runs, where $\widehat{x}_n$ denotes the output of each algorithm with a certain parameter setting.
  • Figure 5: Comparison of averaging (diminishing stepsize) and our algorithm. All the algorithms are initialized at the origin. The total number of samples is $n = 200\cdot \zeta^2$. The error (y-axis) $n\|\widehat{x}_n -x^\star\|_2^2$ is averaged over 100 runs, where $\widehat{x}_n$ denotes the output of each algorithm with a certain parameter setting.

Theorems & Definitions (58)

  • Proposition 3.1
  • Lemma 4.1
  • proof
  • Definition 4.2: Sub-exponential random vectors
  • Remark 1
  • Lemma 4.3
  • Proposition 4.4
  • Proposition 4.5
  • Proposition 5.1
  • proof
  • ...and 48 more