Table of Contents
Fetching ...

ANO : Faster is Better in Noisy Landscape

Adrien Kegreisz

TL;DR

Ano introduces a decoupled optimization paradigm that separates update direction (via the momentum sign) from update magnitude (via instantaneous gradient magnitude scaled by a second-moment term). This yields improved robustness to gradient noise and non-stationarity while maintaining first-order efficiency; Anolog extends this with a logarithmic momentum schedule to reduce tuning burden. Theoretical analysis provides a non-convex convergence guarantee of $\tilde{\mathcal{O}}(K^{-1/4})$, aligning with other sign-based methods, and empirical results demonstrate notable gains in noisy RL and NLP tasks, with competitive performance on standard benchmarks in CV. Overall, Ano offers a practical, robust alternative to momentum-based adaptive optimizers for noisy landscapes, with broad applicability across CV, NLP, and DRL.

Abstract

Stochastic optimizers are central to deep learning, yet widely used methods such as Adam and Adan can degrade in non-stationary or noisy environments, partly due to their reliance on momentum-based magnitude estimates. We introduce Ano, a novel optimizer that decouples direction and magnitude: momentum is used for directional smoothing, while instantaneous gradient magnitudes determine step size. This design improves robustness to gradient noise while retaining the simplicity and efficiency of first-order methods. We further propose Anolog, which removes sensitivity to the momentum coefficient by expanding its window over time via a logarithmic schedule. We establish non-convex convergence guarantees with a convergence rate similar to other sign-based methods, and empirically show that Ano provides substantial gains in noisy and non-stationary regimes such as reinforcement learning, while remaining competitive on low-noise tasks.

ANO : Faster is Better in Noisy Landscape

TL;DR

Ano introduces a decoupled optimization paradigm that separates update direction (via the momentum sign) from update magnitude (via instantaneous gradient magnitude scaled by a second-moment term). This yields improved robustness to gradient noise and non-stationarity while maintaining first-order efficiency; Anolog extends this with a logarithmic momentum schedule to reduce tuning burden. Theoretical analysis provides a non-convex convergence guarantee of , aligning with other sign-based methods, and empirical results demonstrate notable gains in noisy RL and NLP tasks, with competitive performance on standard benchmarks in CV. Overall, Ano offers a practical, robust alternative to momentum-based adaptive optimizers for noisy landscapes, with broad applicability across CV, NLP, and DRL.

Abstract

Stochastic optimizers are central to deep learning, yet widely used methods such as Adam and Adan can degrade in non-stationary or noisy environments, partly due to their reliance on momentum-based magnitude estimates. We introduce Ano, a novel optimizer that decouples direction and magnitude: momentum is used for directional smoothing, while instantaneous gradient magnitudes determine step size. This design improves robustness to gradient noise while retaining the simplicity and efficiency of first-order methods. We further propose Anolog, which removes sensitivity to the momentum coefficient by expanding its window over time via a logarithmic schedule. We establish non-convex convergence guarantees with a convergence rate similar to other sign-based methods, and empirically show that Ano provides substantial gains in noisy and non-stationary regimes such as reinforcement learning, while remaining competitive on low-noise tasks.

Paper Structure

This paper contains 57 sections, 5 theorems, 85 equations, 7 figures, 11 tables, 2 algorithms.

Key Result

Lemma 1

Fix any coordinate $i \in [d]$ and assume $v_0 = 0$ and $\beta_2 \in [\tfrac{1}{2},1)$. Then for every $k \ge 0$,

Figures (7)

  • Figure 1: Training loss on CIFAR-100. Ano reduces loss faster and more stably than Adam.
  • Figure 2: Rewards over time for several MuJoCo environments, with baselines and 95% confidence intervals. The green curve corresponds to Ano (ours).
  • Figure 3: Hyperparameter robustness on a MuJoCo proxy (HalfCheetah with SAC). Adam on the left, Ano(ours) on the right.
  • Figure 4: Rewards over time for Atari5 Benchmark, with baselines and 95% confidence intervals. The green curve corresponds to Ano (ours).
  • Figure 5: Grid search on the CIFAR-10 proxy (ResNet-18) for the optimizers.
  • ...and 2 more figures

Theorems & Definitions (12)

  • Lemma 1: Bounds on $v_k$
  • proof
  • Lemma 2: Sign-Mismatch Probability for Ano
  • proof
  • Lemma 3: Lower bound on the expected update magnitude
  • proof
  • Lemma 4
  • proof
  • Theorem 1: Convergence to a stationary point
  • proof
  • ...and 2 more