On the SAGA algorithm with decreasing step

Luis Fredes; Bernard Bercu; Eméric Gbaguidi

On the SAGA algorithm with decreasing step

Luis Fredes, Bernard Bercu, Eméric Gbaguidi

TL;DR

This paper analyzes a generalized $\lambda$-SAGA algorithm for minimizing $f(x)=\dfrac{1}{N}\sum_{k=1}^N f_k(x)$ with a decreasing step size $\{\gamma_n\}$, unifying SGD and SAGA via $\lambda\in[0,1]$. It shows almost sure convergence to the optimum and a central limit theorem under weakened assumptions that avoid strong convexity and Lipschitz gradient, and it proves non-asymptotic $\mathbb{L}^p$ convergence rates that depend on the rate parameter $\alpha$ of $\gamma_n$. Numerical experiments on MNIST-based logistic regression illustrate the theory, including variance reduction as $\lambda$ increases. Overall, the work provides a unified, practically applicable convergence framework for stochastic variance-reduction methods with decreasing steps.

Abstract

Stochastic optimization naturally appear in many application areas, including machine learning. Our goal is to go further in the analysis of the Stochastic Average Gradient Accelerated (SAGA) algorithm. To achieve this, we introduce a new $λ$-SAGA algorithm which interpolates between the Stochastic Gradient Descent ($λ=0$) and the SAGA algorithm ($λ=1$). Firstly, we investigate the almost sure convergence of this new algorithm with decreasing step which allows us to avoid the restrictive strong convexity and Lipschitz gradient hypotheses associated to the objective function. Secondly, we establish a central limit theorem for the $λ$-SAGA algorithm. Finally, we provide the non-asymptotic $\mathbb{L}^p$ rates of convergence.

On the SAGA algorithm with decreasing step

TL;DR

This paper analyzes a generalized

-SAGA algorithm for minimizing

with a decreasing step size

, unifying SGD and SAGA via

. It shows almost sure convergence to the optimum and a central limit theorem under weakened assumptions that avoid strong convexity and Lipschitz gradient, and it proves non-asymptotic

convergence rates that depend on the rate parameter

. Numerical experiments on MNIST-based logistic regression illustrate the theory, including variance reduction as

increases. Overall, the work provides a unified, practically applicable convergence framework for stochastic variance-reduction methods with decreasing steps.

Abstract

-SAGA algorithm which interpolates between the Stochastic Gradient Descent (

) and the SAGA algorithm (

). Firstly, we investigate the almost sure convergence of this new algorithm with decreasing step which allows us to avoid the restrictive strong convexity and Lipschitz gradient hypotheses associated to the objective function. Secondly, we establish a central limit theorem for the

-SAGA algorithm. Finally, we provide the non-asymptotic

rates of convergence.

Paper Structure (15 sections, 11 theorems, 160 equations, 3 figures)

This paper contains 15 sections, 11 theorems, 160 equations, 3 figures.

Introduction
Related work
The \ref{['sagag']} algorithm
Main results
Almost sure convergence
Asymptotic normality
Non-asymptotic convergence rates
Numerical experiments
Conclusion
Some useful existing results
Proof of Theorem \ref{['sagag_th_tlc1']}
Proof of Theorem \ref{['sagag_th_mse1']}
Proof of Theorem \ref{['sagag_th_mse2']}
Additional asymptotic result on the convergence in $\mathbf{L}^2$
Additional asymptotic result on the convergence in $\mathbf{L}^{p}$

Key Result

Theorem 1

Consider a fixed $\lambda \in [0,1]$. Assume that $(X_n)$ is the sequence generated by the sagag algorithm with decreasing step sequence $(\gamma_n)$ satisfying (gamma_cond1). In addition, suppose that Assumptions saga2_cond1, saga2_cond2 and saga2_cond3 are satisfied. Then, we have and

Figures (3)

Figure 1: Convergence with $\gamma_n=1/n$ for $1.2$M of iterations. Here we put "Gradient evaluations" since instead of using $\|\nabla f(X_n)\|$, we use the norm of the mean associated to the lines in the matrix $g_n$, $\|\sum_{k=1}^N g_{n,k}\|/N$. This quantity keeps track of the convergence since it also converges to 0 and its lines converge to the gradients of the functions $f_k$, that is for each $1 \leqslant k \leqslant N$, $\lim g_{n,k}=\nabla f_k(x^*)$ as $n$ goes to infinity.
Figure 2: We used 1000 samples, where each one was obtained by running the associated algorithm for $n=500000$ iterations.
Figure 3: Mean squared error with respect to epochs. We confirm the decreasing order of the mean squared error of $X_n-x^*$ with respect to $\lambda$ and $n$.

Theorems & Definitions (14)

Theorem 1
Theorem 2
Remark 1
Theorem 3
Theorem 4
Remark 5
Theorem 1.1: Robbins-Siegmund theorem
Lemma 1.1
Lemma 1.2
Remark 1.1
...and 4 more

On the SAGA algorithm with decreasing step

TL;DR

Abstract

On the SAGA algorithm with decreasing step

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (14)