Table of Contents
Fetching ...

Stochastic optimization with arbitrary recurrent data sampling

William G. Powell, Hanbaek Lyu

TL;DR

This work addresses stochastic optimization of a non-convex finite-sum objective under arbitrary recurrent data sampling. By introducing Regularized MISO (RMISO) with proximal regularization (and two variants, RMISO and RMISO_DR), the authors prove that recurrence alone suffices for optimal first-order convergence, with rates scaling as $O(n^{-1/2})$ (and $O(n^{-1/2}\log n)$ for diminishing-radius) and constants that depend on the speed of recursion through $t_{\odot}$ and $t_{hit}$. The framework supports distributed optimization and nonnegative matrix factorization, and is shown to accelerate convergence by choosing sampling schemes that cover data efficiently, both theoretically and empirically. The results provide new insights into data-sampling design for stochastic first-order methods, offering practical guidance for decentralized systems and large-scale factorization tasks where communication or data access patterns deviate from i.i.d. assumptions.

Abstract

For obtaining optimal first-order convergence guarantee for stochastic optimization, it is necessary to use a recurrent data sampling algorithm that samples every data point with sufficient frequency. Most commonly used data sampling algorithms (e.g., i.i.d., MCMC, random reshuffling) are indeed recurrent under mild assumptions. In this work, we show that for a particular class of stochastic optimization algorithms, we do not need any other property (e.g., independence, exponential mixing, and reshuffling) than recurrence in data sampling algorithms to guarantee the optimal rate of first-order convergence. Namely, using regularized versions of Minimization by Incremental Surrogate Optimization (MISO), we show that for non-convex and possibly non-smooth objective functions, the expected optimality gap converges at an optimal rate $O(n^{-1/2})$ under general recurrent sampling schemes. Furthermore, the implied constant depends explicitly on the `speed of recurrence', measured by the expected amount of time to visit a given data point either averaged (`target time') or supremized (`hitting time') over the current location. We demonstrate theoretically and empirically that convergence can be accelerated by selecting sampling algorithms that cover the data set most effectively. We discuss applications of our general framework to decentralized optimization and distributed non-negative matrix factorization.

Stochastic optimization with arbitrary recurrent data sampling

TL;DR

This work addresses stochastic optimization of a non-convex finite-sum objective under arbitrary recurrent data sampling. By introducing Regularized MISO (RMISO) with proximal regularization (and two variants, RMISO and RMISO_DR), the authors prove that recurrence alone suffices for optimal first-order convergence, with rates scaling as (and for diminishing-radius) and constants that depend on the speed of recursion through and . The framework supports distributed optimization and nonnegative matrix factorization, and is shown to accelerate convergence by choosing sampling schemes that cover data efficiently, both theoretically and empirically. The results provide new insights into data-sampling design for stochastic first-order methods, offering practical guidance for decentralized systems and large-scale factorization tasks where communication or data access patterns deviate from i.i.d. assumptions.

Abstract

For obtaining optimal first-order convergence guarantee for stochastic optimization, it is necessary to use a recurrent data sampling algorithm that samples every data point with sufficient frequency. Most commonly used data sampling algorithms (e.g., i.i.d., MCMC, random reshuffling) are indeed recurrent under mild assumptions. In this work, we show that for a particular class of stochastic optimization algorithms, we do not need any other property (e.g., independence, exponential mixing, and reshuffling) than recurrence in data sampling algorithms to guarantee the optimal rate of first-order convergence. Namely, using regularized versions of Minimization by Incremental Surrogate Optimization (MISO), we show that for non-convex and possibly non-smooth objective functions, the expected optimality gap converges at an optimal rate under general recurrent sampling schemes. Furthermore, the implied constant depends explicitly on the `speed of recurrence', measured by the expected amount of time to visit a given data point either averaged (`target time') or supremized (`hitting time') over the current location. We demonstrate theoretically and empirically that convergence can be accelerated by selecting sampling algorithms that cover the data set most effectively. We discuss applications of our general framework to decentralized optimization and distributed non-negative matrix factorization.
Paper Structure (36 sections, 25 theorems, 178 equations, 5 figures, 1 table, 4 algorithms)

This paper contains 36 sections, 25 theorems, 178 equations, 5 figures, 1 table, 4 algorithms.

Key Result

Theorem 3.8

Algorithms RMISO and RMISO_DR satisfy the following for any $N \geq 1$:

Figures (5)

  • Figure 1: Lonely graph
  • Figure 2: Plot of reconstruction error against interation number for NMF using two sampling algorithms. Results show the performance of algorithms RMISO, MISO (Algorithm \ref{['RMISO']} with $\rho_n = 0$), ONMF, and AdaGrad in factorizing a collection of MNIST mnist data matrices.
  • Figure 3: Plot of objective loss and standard deviation against the test dataset for a9a for two graph topologies and various optimization algorithms- RMISO, MISO (Algorithm \ref{['RMISO']} with $\rho_n =0$), AdaGrad, MCSAG, SGD, Adam, and SGD-HB
  • Figure 4: Plot of reconstruction error against compute time for NMF using two sampling algorithms. Results show the performance of algorithms RMISO, MISO (Algorithm \ref{['RMISO']} with $\rho_n = 0$), ONMF, and AdaGrad in factorizing a collection of MNIST mnist data matrices.
  • Figure 5: Plot of objective loss and standard deviation vs compute time for a9a for two graph topologies and various optimization algorithms- RMISO, MISO (Algorithm \ref{['RMISO']} with $\rho_n =0$), AdaGrad, MCSAG, SGD, Adam, and SGD-HB

Theorems & Definitions (65)

  • Definition 2.1: First-order surrogates
  • Definition 2.2: Return time
  • Definition 2.3: Last passage time
  • Theorem 3.8: Rate of Convergence to Stationarity
  • Theorem 3.9: Global Convergence
  • Theorem 1.1: Extended Version of Theorem \ref{['thm: convergence rates to stationarity']} in the main text
  • Corollary 1.2
  • Corollary 1.3: Iteration Complexity
  • Remark 1.4: Comparison with the lower bound of even2023stochastic Theorem 1
  • Remark 1.5: Optimal sampling and estimating $t_{\odot}$ and $t_{\text{hit}}$
  • ...and 55 more