RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization
El Mahdi Chayti
TL;DR
RanSOM tackles curvature-induced bias in stochastic momentum for non-convex optimization by replacing fixed step sizes with randomized steps whose mean is $\eta_t$, and using Stein-type integration-by-parts to obtain an unbiased estimate of the curvature bias with a single Hessian-vector product. It instantiates two algorithms: RanSOM-E for unconstrained problems with exponential step distributions and RanSOM-B for constrained problems with beta distributions, both compatible with Linear Minimization Oracle updates. Theoretical results show optimal $\mathcal{O}(\epsilon^{-3})$ convergence under standard bounded-noise assumptions, and robustness to heavy-tailed gradient noise ($p,q\in(1,2]$) without requiring Lipschitz Hessians or auxiliary queries. Empirical results on Splice, MNIST1D, and Nano MovieLens demonstrate improved stability and performance relative to state-of-the-art baselines, validating RanSOM as a practical, assumption-light framework for non-convex and constrained optimization.
Abstract
Momentum methods, such as Polyak's Heavy Ball, are the standard for training deep networks but suffer from curvature-induced bias in stochastic settings, limiting convergence to suboptimal $\mathcal{O}(ε^{-4})$ rates. Existing corrections typically require expensive auxiliary sampling or restrictive smoothness assumptions. We propose \textbf{RanSOM}, a unified framework that eliminates this bias by replacing deterministic step sizes with randomized steps drawn from distributions with mean $η_t$. This modification allows us to leverage Stein-type identities to compute an exact, unbiased estimate of the momentum bias using a single Hessian-vector product computed jointly with the gradient, avoiding auxiliary queries. We instantiate this framework in two algorithms: \textbf{RanSOM-E} for unconstrained optimization (using exponentially distributed steps) and \textbf{RanSOM-B} for constrained optimization (using beta-distributed steps to strictly preserve feasibility). Theoretical analysis confirms that RanSOM recovers the optimal $\mathcal{O}(ε^{-3})$ convergence rate under standard bounded noise, and achieves optimal rates for heavy-tailed noise settings ($p \in (1, 2]$) without requiring gradient clipping.
