RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization

El Mahdi Chayti

RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization

El Mahdi Chayti

TL;DR

RanSOM tackles curvature-induced bias in stochastic momentum for non-convex optimization by replacing fixed step sizes with randomized steps whose mean is $\eta_t$, and using Stein-type integration-by-parts to obtain an unbiased estimate of the curvature bias with a single Hessian-vector product. It instantiates two algorithms: RanSOM-E for unconstrained problems with exponential step distributions and RanSOM-B for constrained problems with beta distributions, both compatible with Linear Minimization Oracle updates. Theoretical results show optimal $\mathcal{O}(\epsilon^{-3})$ convergence under standard bounded-noise assumptions, and robustness to heavy-tailed gradient noise ($p,q\in(1,2]$) without requiring Lipschitz Hessians or auxiliary queries. Empirical results on Splice, MNIST1D, and Nano MovieLens demonstrate improved stability and performance relative to state-of-the-art baselines, validating RanSOM as a practical, assumption-light framework for non-convex and constrained optimization.

Abstract

Momentum methods, such as Polyak's Heavy Ball, are the standard for training deep networks but suffer from curvature-induced bias in stochastic settings, limiting convergence to suboptimal $\mathcal{O}(ε^{-4})$ rates. Existing corrections typically require expensive auxiliary sampling or restrictive smoothness assumptions. We propose \textbf{RanSOM}, a unified framework that eliminates this bias by replacing deterministic step sizes with randomized steps drawn from distributions with mean $η_t$. This modification allows us to leverage Stein-type identities to compute an exact, unbiased estimate of the momentum bias using a single Hessian-vector product computed jointly with the gradient, avoiding auxiliary queries. We instantiate this framework in two algorithms: \textbf{RanSOM-E} for unconstrained optimization (using exponentially distributed steps) and \textbf{RanSOM-B} for constrained optimization (using beta-distributed steps to strictly preserve feasibility). Theoretical analysis confirms that RanSOM recovers the optimal $\mathcal{O}(ε^{-3})$ convergence rate under standard bounded noise, and achieves optimal rates for heavy-tailed noise settings ($p \in (1, 2]$) without requiring gradient clipping.

RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization

TL;DR

RanSOM tackles curvature-induced bias in stochastic momentum for non-convex optimization by replacing fixed step sizes with randomized steps whose mean is

, and using Stein-type integration-by-parts to obtain an unbiased estimate of the curvature bias with a single Hessian-vector product. It instantiates two algorithms: RanSOM-E for unconstrained problems with exponential step distributions and RanSOM-B for constrained problems with beta distributions, both compatible with Linear Minimization Oracle updates. Theoretical results show optimal

convergence under standard bounded-noise assumptions, and robustness to heavy-tailed gradient noise (

) without requiring Lipschitz Hessians or auxiliary queries. Empirical results on Splice, MNIST1D, and Nano MovieLens demonstrate improved stability and performance relative to state-of-the-art baselines, validating RanSOM as a practical, assumption-light framework for non-convex and constrained optimization.

Abstract

Momentum methods, such as Polyak's Heavy Ball, are the standard for training deep networks but suffer from curvature-induced bias in stochastic settings, limiting convergence to suboptimal

rates. Existing corrections typically require expensive auxiliary sampling or restrictive smoothness assumptions. We propose \textbf{RanSOM}, a unified framework that eliminates this bias by replacing deterministic step sizes with randomized steps drawn from distributions with mean

. This modification allows us to leverage Stein-type identities to compute an exact, unbiased estimate of the momentum bias using a single Hessian-vector product computed jointly with the gradient, avoiding auxiliary queries. We instantiate this framework in two algorithms: \textbf{RanSOM-E} for unconstrained optimization (using exponentially distributed steps) and \textbf{RanSOM-B} for constrained optimization (using beta-distributed steps to strictly preserve feasibility). Theoretical analysis confirms that RanSOM recovers the optimal

convergence rate under standard bounded noise, and achieves optimal rates for heavy-tailed noise settings (

) without requiring gradient clipping.

Paper Structure (38 sections, 18 theorems, 74 equations, 4 figures, 3 tables, 2 algorithms)

This paper contains 38 sections, 18 theorems, 74 equations, 4 figures, 3 tables, 2 algorithms.

Introduction
The Bias-Variance Bottleneck
The Landscape of Bias Correction
Our Contribution: RanSOM
Related Work
Method: The RanSOM Framework
Randomized Integration by Parts
Joint Efficient Computation via Automatic Differentiation
Algorithm 1: RanSOM-E (Unconstrained / Normalized)
Algorithm 2: RanSOM-B (Constrained)
Theoretical Analysis
Preliminaries & Notation
Assumptions
Unconstrained Optimization (RanSOM-E)
Constrained Optimization (RanSOM-B)
...and 23 more sections

Key Result

Lemma 3.1

Let $g: \mathbb{R} \to \mathbb{R}^d$ be a differentiable function with bounded derivative.

Figures (4)

Figure 1: Momentum Bias. The historic momentum $m_{t-1}$ approximates $\nabla f(x_{t-1})$, but the optimizer has moved to $x_t$. The deviation (orange vector) is the bias induced by the Hessian $\nabla^2 f$. To accelerate convergence, this bias must be corrected.
Figure 2: Training Loss (left) and Test Accuracy (right) on the Splice dataset with Welsch regularization. RanSOM-E (Muon) shows the fastest convergence and highest final accuracy.
Figure 3: Training Loss (left) and Test Accuracy (right) on MNIST1D. RanSOM-E variants consistently outperform first-order and classic second-order baselines.
Figure 4: Average RMSE on Nano MovieLens Matrix Completion. RanSOM-B (green) converges to a lower final error than SFW-Polyak and SFW-SOM.

Theorems & Definitions (27)

Lemma 3.1: Stein-Type Identities for Optimization
Lemma 4.5: Descent Inequality
Lemma 4.6: Momentum Error Bound
Theorem 4.7: Convergence of RanSOM-E
Lemma 4.8: Constrained Descent Inequality
Theorem 4.9: Convergence of RanSOM-B
Lemma 1.1: Stein-Type Identities
proof
Lemma 1.2: Bound on Centered Hessian Error
proof
...and 17 more

RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization

TL;DR

Abstract

RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (27)