A stochastic first-order method with multi-extrapolated momentum for highly smooth unconstrained optimization
Chuan He
TL;DR
This work tackles unconstrained stochastic optimization when the objective $f$ has a $p$-th order Lipschitz derivative ($p\ge 2$). It introduces a stochastic first-order method with multi-extrapolated momentum that performs $p-1$ extrapolations per iteration and uses a time-varying, momentum-based gradient estimator designed to exploit higher-order smoothness. Under standard assumptions plus $D^p f$ being Lipschitz, the authors prove a sample complexity of $\widetilde{\mathcal{O}}(ε^{-(3p+1)/p})$ to obtain $\mathbb{E}[\|\nabla f(x)\|]\le ε$, improving upon bounds that rely on mean-squared smoothness. Numerical experiments on data fitting and robust regression problems corroborate the theoretical gains and demonstrate practical advantages of the multi-extrapolated momentum approach.
Abstract
In this paper, we consider an unconstrained stochastic optimization problem where the objective function exhibits high-order smoothness. Specifically, we propose a new stochastic first-order method (SFOM) with multi-extrapolated momentum, in which multiple extrapolations are performed in each iteration, followed by a momentum update based on these extrapolations. We demonstrate that the proposed SFOM can accelerate optimization by exploiting the high-order smoothness of the objective function $f$. Assuming that the $p$th-order derivative of $f$ is Lipschitz continuous for some $p\ge2$, and under additional mild assumptions, we establish that our method achieves a sample complexity of $\widetilde{\mathcal{O}}(ε^{-(3p+1)/p})$ for finding a point $x$ such that $\mathbb{E}[\|\nabla f(x)\|]\leε$. To the best of our knowledge, this is the first SFOM to leverage arbitrary-order smoothness of the objective function for acceleration, resulting in a sample complexity that improves upon the best-known results without assuming the mean-squared smoothness condition. Preliminary numerical experiments validate the practical performance of our method and support our theoretical findings.
