A multilevel stochastic regularized first-order method with application to finite sum minimization
Filippo Marini, Margherita Porcelli, Elisa Riccietti
TL;DR
This work addresses large-scale stochastic optimization by introducing MU$^{\ell}$STREG, a multilevel stochastic adaptive-regularization gradient method that builds a hierarchy of computable approximations in either the variable space or the function space. By alternating fine stochastic steps with cheaper coarse steps, the method reduces iteration cost while maintaining convergence guarantees; the authors prove almost-sure convergence to a first-order stationary point under probabilistic accuracy assumptions on models and estimates. The framework extends deterministic multilevel ideas and STORM to a stochastic setting with hierarchies that do not require exact finest-level matching to the original objective throughout optimization, and it specializes to finite-sum minimization with hierarchical subsampling. Numerical experiments on binary classification show MU$^{\ell}$STREG outperforms one-level variants and competes with SVRG and Adagrad, highlighting practical impact for scalable learning tasks where full-data passes are expensive.
Abstract
In this paper, we propose a multilevel stochastic framework for the solution of nonconvex unconstrained optimization problems. The proposed approach uses random regularized first-order models that exploit an available hierarchical description of the problem, being either in the classical variable space or in the function space, meaning that different levels of accuracy for the objective function are available. We propose a convergence analysis showing an almost sure global convergence of the method to a first order stationary point. The numerical behavior is tested on the solution of finite sum minimization problems. Differently from classical deterministic multilevel schemes, our stochastic method does not require the finest approximation to coincide with the original objective function along all the optimization process. This allows for significantly decreasing their cost, for instance in data-fitting problems, where considering all the data at each iteration can be avoided.
