Table of Contents
Fetching ...

RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization

El Mahdi Chayti

TL;DR

RanSOM tackles curvature-induced bias in stochastic momentum for non-convex optimization by replacing fixed step sizes with randomized steps whose mean is $\eta_t$, and using Stein-type integration-by-parts to obtain an unbiased estimate of the curvature bias with a single Hessian-vector product. It instantiates two algorithms: RanSOM-E for unconstrained problems with exponential step distributions and RanSOM-B for constrained problems with beta distributions, both compatible with Linear Minimization Oracle updates. Theoretical results show optimal $\mathcal{O}(\epsilon^{-3})$ convergence under standard bounded-noise assumptions, and robustness to heavy-tailed gradient noise ($p,q\in(1,2]$) without requiring Lipschitz Hessians or auxiliary queries. Empirical results on Splice, MNIST1D, and Nano MovieLens demonstrate improved stability and performance relative to state-of-the-art baselines, validating RanSOM as a practical, assumption-light framework for non-convex and constrained optimization.

Abstract

Momentum methods, such as Polyak's Heavy Ball, are the standard for training deep networks but suffer from curvature-induced bias in stochastic settings, limiting convergence to suboptimal $\mathcal{O}(ε^{-4})$ rates. Existing corrections typically require expensive auxiliary sampling or restrictive smoothness assumptions. We propose \textbf{RanSOM}, a unified framework that eliminates this bias by replacing deterministic step sizes with randomized steps drawn from distributions with mean $η_t$. This modification allows us to leverage Stein-type identities to compute an exact, unbiased estimate of the momentum bias using a single Hessian-vector product computed jointly with the gradient, avoiding auxiliary queries. We instantiate this framework in two algorithms: \textbf{RanSOM-E} for unconstrained optimization (using exponentially distributed steps) and \textbf{RanSOM-B} for constrained optimization (using beta-distributed steps to strictly preserve feasibility). Theoretical analysis confirms that RanSOM recovers the optimal $\mathcal{O}(ε^{-3})$ convergence rate under standard bounded noise, and achieves optimal rates for heavy-tailed noise settings ($p \in (1, 2]$) without requiring gradient clipping.

RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization

TL;DR

RanSOM tackles curvature-induced bias in stochastic momentum for non-convex optimization by replacing fixed step sizes with randomized steps whose mean is , and using Stein-type integration-by-parts to obtain an unbiased estimate of the curvature bias with a single Hessian-vector product. It instantiates two algorithms: RanSOM-E for unconstrained problems with exponential step distributions and RanSOM-B for constrained problems with beta distributions, both compatible with Linear Minimization Oracle updates. Theoretical results show optimal convergence under standard bounded-noise assumptions, and robustness to heavy-tailed gradient noise () without requiring Lipschitz Hessians or auxiliary queries. Empirical results on Splice, MNIST1D, and Nano MovieLens demonstrate improved stability and performance relative to state-of-the-art baselines, validating RanSOM as a practical, assumption-light framework for non-convex and constrained optimization.

Abstract

Momentum methods, such as Polyak's Heavy Ball, are the standard for training deep networks but suffer from curvature-induced bias in stochastic settings, limiting convergence to suboptimal rates. Existing corrections typically require expensive auxiliary sampling or restrictive smoothness assumptions. We propose \textbf{RanSOM}, a unified framework that eliminates this bias by replacing deterministic step sizes with randomized steps drawn from distributions with mean . This modification allows us to leverage Stein-type identities to compute an exact, unbiased estimate of the momentum bias using a single Hessian-vector product computed jointly with the gradient, avoiding auxiliary queries. We instantiate this framework in two algorithms: \textbf{RanSOM-E} for unconstrained optimization (using exponentially distributed steps) and \textbf{RanSOM-B} for constrained optimization (using beta-distributed steps to strictly preserve feasibility). Theoretical analysis confirms that RanSOM recovers the optimal convergence rate under standard bounded noise, and achieves optimal rates for heavy-tailed noise settings () without requiring gradient clipping.
Paper Structure (38 sections, 18 theorems, 74 equations, 4 figures, 3 tables, 2 algorithms)

This paper contains 38 sections, 18 theorems, 74 equations, 4 figures, 3 tables, 2 algorithms.

Key Result

Lemma 3.1

Let $g: \mathbb{R} \to \mathbb{R}^d$ be a differentiable function with bounded derivative.

Figures (4)

  • Figure 1: Momentum Bias. The historic momentum $m_{t-1}$ approximates $\nabla f(x_{t-1})$, but the optimizer has moved to $x_t$. The deviation (orange vector) is the bias induced by the Hessian $\nabla^2 f$. To accelerate convergence, this bias must be corrected.
  • Figure 2: Training Loss (left) and Test Accuracy (right) on the Splice dataset with Welsch regularization. RanSOM-E (Muon) shows the fastest convergence and highest final accuracy.
  • Figure 3: Training Loss (left) and Test Accuracy (right) on MNIST1D. RanSOM-E variants consistently outperform first-order and classic second-order baselines.
  • Figure 4: Average RMSE on Nano MovieLens Matrix Completion. RanSOM-B (green) converges to a lower final error than SFW-Polyak and SFW-SOM.

Theorems & Definitions (27)

  • Lemma 3.1: Stein-Type Identities for Optimization
  • Lemma 4.5: Descent Inequality
  • Lemma 4.6: Momentum Error Bound
  • Theorem 4.7: Convergence of RanSOM-E
  • Lemma 4.8: Constrained Descent Inequality
  • Theorem 4.9: Convergence of RanSOM-B
  • Lemma 1.1: Stein-Type Identities
  • proof
  • Lemma 1.2: Bound on Centered Hessian Error
  • proof
  • ...and 17 more