Table of Contents
Fetching ...

Variance reduction techniques for stochastic proximal point algorithms

Cheik Traoré, Vassilis Apidopoulos, Saverio Salzo, Silvia Villa

TL;DR

This work addresses finite-sum optimization $F(\bm{x}) = \frac{1}{n}\sum_{i=1}^n f_i(\bm{x})$ by introducing a unified variance-reduced stochastic proximal-point framework that encompasses proximal SVRG, SAGA, and related variants. It develops a generic algorithm with a variance-correcting term $\bm{e}^k$ and proves $O(1/k)$ convergence for convex $F$, plus linear convergence under the Polyak-Łojasiewicz condition with constant steps. The paper derives concrete schemes (SVRP, SAPA, L-SVRP) and shows through experiments that proximal variance-reduction methods offer greater stability to step-size choices and competitive or superior performance compared to gradient-based variants on challenging problems. These results highlight the practical robustness and efficiency of proximal variance-reduction approaches for large-scale finite-sum optimization.

Abstract

In the context of finite sums minimization, variance reduction techniques are widely used to improve the performance of state-of-the-art stochastic gradient methods. Their practical impact is clear, as well as their theoretical properties. Stochastic proximal point algorithms have been studied as an alternative to stochastic gradient algorithms since they are more stable with respect to the choice of the step size. However, their variance-reduced versions are not as well studied as the gradient ones. In this work, we propose the first unified study of variance reduction techniques for stochastic proximal point algorithms. We introduce a generic stochastic proximal-based algorithm that can be specified to give the proximal version of SVRG, SAGA, and some of their variants. For this algorithm, in the smooth setting, we provide several convergence rates for the iterates and the objective function values, which are faster than those of the vanilla stochastic proximal point algorithm. More specifically, for convex functions, we prove a sublinear convergence rate of $O(1/k)$. In addition, under the Polyak-Łojasiewicz (PL) condition, we obtain linear convergence rates. Finally, our numerical experiments demonstrate the advantages of the proximal variance reduction methods over their gradient counterparts in terms of the stability with respect to the choice of the step size in most cases, especially for difficult problems.

Variance reduction techniques for stochastic proximal point algorithms

TL;DR

This work addresses finite-sum optimization by introducing a unified variance-reduced stochastic proximal-point framework that encompasses proximal SVRG, SAGA, and related variants. It develops a generic algorithm with a variance-correcting term and proves convergence for convex , plus linear convergence under the Polyak-Łojasiewicz condition with constant steps. The paper derives concrete schemes (SVRP, SAPA, L-SVRP) and shows through experiments that proximal variance-reduction methods offer greater stability to step-size choices and competitive or superior performance compared to gradient-based variants on challenging problems. These results highlight the practical robustness and efficiency of proximal variance-reduction approaches for large-scale finite-sum optimization.

Abstract

In the context of finite sums minimization, variance reduction techniques are widely used to improve the performance of state-of-the-art stochastic gradient methods. Their practical impact is clear, as well as their theoretical properties. Stochastic proximal point algorithms have been studied as an alternative to stochastic gradient algorithms since they are more stable with respect to the choice of the step size. However, their variance-reduced versions are not as well studied as the gradient ones. In this work, we propose the first unified study of variance reduction techniques for stochastic proximal point algorithms. We introduce a generic stochastic proximal-based algorithm that can be specified to give the proximal version of SVRG, SAGA, and some of their variants. For this algorithm, in the smooth setting, we provide several convergence rates for the iterates and the objective function values, which are faster than those of the vanilla stochastic proximal point algorithm. More specifically, for convex functions, we prove a sublinear convergence rate of . In addition, under the Polyak-Łojasiewicz (PL) condition, we obtain linear convergence rates. Finally, our numerical experiments demonstrate the advantages of the proximal variance reduction methods over their gradient counterparts in terms of the stability with respect to the choice of the step size in most cases, especially for difficult problems.
Paper Structure (27 sections, 14 theorems, 78 equations, 5 figures, 5 algorithms)

This paper contains 27 sections, 14 theorems, 78 equations, 5 figures, 5 algorithms.

Key Result

Proposition 3.1

Suppose that Assumptions ass:variance and ass:a11 are verified and that the sequence $( \bm{x}^k)_{k \in \mathbb{N}}$ is generated by Algorithm algo:unified. Let $M > 0$. Then, for all $k \in \mathbb{N}$,

Figures (5)

  • Figure 1: Evolution of $F(\tilde{ \bm{x}}^s)-F_*$, with respect to the normalized iterations counter, for different values of number of functions $n$.We compare the performance of SAPA (blue) and SVRP (green) to SPPA (orange) in the logistic regression case.
  • Figure 2: Evolution of $F(\tilde{ \bm{x}}^s)-F_*$, with respect to the normalized iterations counter, for different values of number of functions $n$. We compare the performance of SAPA (blue) and SVRP (green) to SPPA (orange) in the ordinary least squares case.
  • Figure 3: Number of iterations needed in order to achieve an accuracy of at least $0.01$ for different step sizes when solving an OLS problem. A cap is put at 40000 iterations. Here we compare SAPA (in blue) with SAGA (in orange) for different values of the number of functions n.
  • Figure 4: Number of iterations needed in order to achieve an accuracy of at least $0.01$ for different step sizes when solving a logistic problem. A cap is put at $2.5\times 10^{7}$ iterations. Here we compare SAPA (in blue) with SAGA (in orange) for different values of the number of functions n.
  • Figure 5: Number of iterations needed in order to achieve an accuracy of at least $0.01$ for different step sizes when solving an OLS problem. Here we compare SVRP (in blue) with SVRG (in orange) for four different values of the dimension $d$ in problem \ref{['prob:ols']}, starting from $d=1000$ (easy case) to $d=3000$ (hard case).

Theorems & Definitions (27)

  • Remark 2.2
  • Remark 2.3
  • Proposition 3.1
  • Theorem 3.2
  • proof
  • Theorem 3.3
  • proof
  • Remark 3.4
  • Lemma 4.2
  • Theorem 4.3
  • ...and 17 more