Stochastic optimization with arbitrary recurrent data sampling

William G. Powell; Hanbaek Lyu

Stochastic optimization with arbitrary recurrent data sampling

William G. Powell, Hanbaek Lyu

TL;DR

This work addresses stochastic optimization of a non-convex finite-sum objective under arbitrary recurrent data sampling. By introducing Regularized MISO (RMISO) with proximal regularization (and two variants, RMISO and RMISO_DR), the authors prove that recurrence alone suffices for optimal first-order convergence, with rates scaling as $O(n^{-1/2})$ (and $O(n^{-1/2}\log n)$ for diminishing-radius) and constants that depend on the speed of recursion through $t_{\odot}$ and $t_{hit}$. The framework supports distributed optimization and nonnegative matrix factorization, and is shown to accelerate convergence by choosing sampling schemes that cover data efficiently, both theoretically and empirically. The results provide new insights into data-sampling design for stochastic first-order methods, offering practical guidance for decentralized systems and large-scale factorization tasks where communication or data access patterns deviate from i.i.d. assumptions.

Abstract

For obtaining optimal first-order convergence guarantee for stochastic optimization, it is necessary to use a recurrent data sampling algorithm that samples every data point with sufficient frequency. Most commonly used data sampling algorithms (e.g., i.i.d., MCMC, random reshuffling) are indeed recurrent under mild assumptions. In this work, we show that for a particular class of stochastic optimization algorithms, we do not need any other property (e.g., independence, exponential mixing, and reshuffling) than recurrence in data sampling algorithms to guarantee the optimal rate of first-order convergence. Namely, using regularized versions of Minimization by Incremental Surrogate Optimization (MISO), we show that for non-convex and possibly non-smooth objective functions, the expected optimality gap converges at an optimal rate $O(n^{-1/2})$ under general recurrent sampling schemes. Furthermore, the implied constant depends explicitly on the `speed of recurrence', measured by the expected amount of time to visit a given data point either averaged (`target time') or supremized (`hitting time') over the current location. We demonstrate theoretically and empirically that convergence can be accelerated by selecting sampling algorithms that cover the data set most effectively. We discuss applications of our general framework to decentralized optimization and distributed non-negative matrix factorization.

Stochastic optimization with arbitrary recurrent data sampling

TL;DR

(and

for diminishing-radius) and constants that depend on the speed of recursion through

and

. The framework supports distributed optimization and nonnegative matrix factorization, and is shown to accelerate convergence by choosing sampling schemes that cover data efficiently, both theoretically and empirically. The results provide new insights into data-sampling design for stochastic first-order methods, offering practical guidance for decentralized systems and large-scale factorization tasks where communication or data access patterns deviate from i.i.d. assumptions.

Abstract

under general recurrent sampling schemes. Furthermore, the implied constant depends explicitly on the `speed of recurrence', measured by the expected amount of time to visit a given data point either averaged (`target time') or supremized (`hitting time') over the current location. We demonstrate theoretically and empirically that convergence can be accelerated by selecting sampling algorithms that cover the data set most effectively. We discuss applications of our general framework to decentralized optimization and distributed non-negative matrix factorization.

Paper Structure (36 sections, 25 theorems, 178 equations, 5 figures, 1 table, 4 algorithms)

This paper contains 36 sections, 25 theorems, 178 equations, 5 figures, 1 table, 4 algorithms.

Introduction
Contribution
Related Work
Notation
Preliminary Definitions and Algorithm Statement
Main Results
Optimality Conditions
Assumptions
Statement of main results
Sketch of proofs
Applications and Experiments
Applications
Distributed Matrix Factorization
Prox-Linear Surrogates
Experiments
...and 21 more sections

Key Result

Theorem 3.8

Algorithms RMISO and RMISO_DR satisfy the following for any $N \geq 1$:

Figures (5)

Figure 1: Lonely graph
Figure 2: Plot of reconstruction error against interation number for NMF using two sampling algorithms. Results show the performance of algorithms RMISO, MISO (Algorithm \ref{['RMISO']} with $\rho_n = 0$), ONMF, and AdaGrad in factorizing a collection of MNIST mnist data matrices.
Figure 3: Plot of objective loss and standard deviation against the test dataset for a9a for two graph topologies and various optimization algorithms- RMISO, MISO (Algorithm \ref{['RMISO']} with $\rho_n =0$), AdaGrad, MCSAG, SGD, Adam, and SGD-HB
Figure 4: Plot of reconstruction error against compute time for NMF using two sampling algorithms. Results show the performance of algorithms RMISO, MISO (Algorithm \ref{['RMISO']} with $\rho_n = 0$), ONMF, and AdaGrad in factorizing a collection of MNIST mnist data matrices.
Figure 5: Plot of objective loss and standard deviation vs compute time for a9a for two graph topologies and various optimization algorithms- RMISO, MISO (Algorithm \ref{['RMISO']} with $\rho_n =0$), AdaGrad, MCSAG, SGD, Adam, and SGD-HB

Theorems & Definitions (65)

Definition 2.1: First-order surrogates
Definition 2.2: Return time
Definition 2.3: Last passage time
Theorem 3.8: Rate of Convergence to Stationarity
Theorem 3.9: Global Convergence
Theorem 1.1: Extended Version of Theorem \ref{['thm: convergence rates to stationarity']} in the main text
Corollary 1.2
Corollary 1.3: Iteration Complexity
Remark 1.4: Comparison with the lower bound of even2023stochastic Theorem 1
Remark 1.5: Optimal sampling and estimating $t_{\odot}$ and $t_{\text{hit}}$
...and 55 more

Stochastic optimization with arbitrary recurrent data sampling

TL;DR

Abstract

Stochastic optimization with arbitrary recurrent data sampling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (65)