AlphaGrad: Non-Linear Gradient Normalization Optimizer
Soham Sane
TL;DR
AlphaGrad introduces a memory-efficient optimizer that applies layer-wise L2 gradient normalization $\tilde{g}_t^L = g_t^L / (\|g_t^L\|_2 + \epsilon)$ followed by a smooth non-linear transformation $g_t^{\prime L} = \tanh(\alpha^L \cdot \tilde{g}_t^L)$, yielding bounded updates and reduced per-parameter state. The authors provide both convex and non-convex convergence analyses, showing dependence on problem dimension $n$ and the alignment factor $\gamma_{min}$ through $\gamma_t = \tanh(\alpha) \frac{\|g_t\|_2}{\|g_t\|_2 + \epsilon}$, with rates like $O\left(\frac{\sqrt{n}}{\gamma_{\min}\sqrt{T}}\right)$ in the convex case and $O\left( \sqrt{\frac{L n (f(x_1) - f_{inf})}{\tanh^2(\alpha) T}} \right)$ in the non-convex setting. Empirically, AlphaGrad shows highly context-dependent performance across RL tasks: instability in off-policy DQN, improved stability with TD3 when $\alpha$ is tuned, and substantially superior performance in on-policy PPO, underscoring the critical role of empirical $\alpha$ selection. The work highlights AlphaGrad as a compelling alternative for memory-constrained training, particularly in on-policy regimes, while also outlining strong avenues for further validation, adaptive scheduling, and broader benchmarking.
Abstract
We introduce AlphaGrad, a memory-efficient, conditionally stateless optimizer addressing the memory overhead and hyperparameter complexity of adaptive methods like Adam. AlphaGrad enforces scale invariance via tensor-wise L2 gradient normalization followed by a smooth hyperbolic tangent transformation, $g' = \tanh(α\cdot \tilde{g})$, controlled by a single steepness parameter $α$. Our contributions include: (1) the AlphaGrad algorithm formulation; (2) a formal non-convex convergence analysis guaranteeing stationarity; (3) extensive empirical evaluation on diverse RL benchmarks (DQN, TD3, PPO). Compared to Adam, AlphaGrad demonstrates a highly context-dependent performance profile. While exhibiting instability in off-policy DQN, it provides enhanced training stability with competitive results in TD3 (requiring careful $α$ tuning) and achieves substantially superior performance in on-policy PPO. These results underscore the critical importance of empirical $α$ selection, revealing strong interactions between the optimizer's dynamics and the underlying RL algorithm. AlphaGrad presents a compelling alternative optimizer for memory-constrained scenarios and shows significant promise for on-policy learning regimes where its stability and efficiency advantages can be particularly impactful.
