Table of Contents
Fetching ...

HVAdam: A Full-Dimension Adaptive Optimizer

Yiheng Zhang, Shaowu Wu, Yuanzhuo Xu, Jiajun Wu, Shang Xu, Steve Drew, Xiaoguang Niu

TL;DR

The paper identifies adaptivity in pre-conditioners as a key factor limiting generalization for adaptive optimizers in valley-like landscapes. It proposes HVAdam, a full-dimension adaptive optimizer that uses a hidden vector to capture invariant gradient trends, plus a restart strategy and a noise-aware preconditioning scheme. The authors provide convergence guarantees for both convex and non-convex settings and demonstrate substantial empirical improvements across image classification, NLP, and GAN tasks. The results suggest adaptivity can be tuned to bridge classical SGD and Adam behavior, offering a unified framework that outperforms existing optimizers.

Abstract

Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity , allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.

HVAdam: A Full-Dimension Adaptive Optimizer

TL;DR

The paper identifies adaptivity in pre-conditioners as a key factor limiting generalization for adaptive optimizers in valley-like landscapes. It proposes HVAdam, a full-dimension adaptive optimizer that uses a hidden vector to capture invariant gradient trends, plus a restart strategy and a noise-aware preconditioning scheme. The authors provide convergence guarantees for both convex and non-convex settings and demonstrate substantial empirical improvements across image classification, NLP, and GAN tasks. The results suggest adaptivity can be tuned to bridge classical SGD and Adam behavior, offering a unified framework that outperforms existing optimizers.

Abstract

Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity , allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.

Paper Structure

This paper contains 30 sections, 7 theorems, 56 equations, 10 figures, 7 tables, 3 algorithms.

Key Result

Theorem B.1

Under the assumptions: The value of $v_t$ in the proposed algorithm satisfies: It means that $v_t$ has a convergence rate of $O(1/T)$.

Figures (10)

  • Figure 1: A typical example of the valley dilemma. (a) (c) depict the trajectories of SGD, Adam, AdaBelief and HVAdam in both 2D and 3D plots. (b) is a close-up of the red box area in (a), showing the slow convergence and zigzagging behavior of Adam, AdaBelief, and SGD. However, HVAdam demonstrates rapid convergence along the hidden vector direction.
  • Figure 1: Training accuracies with three models using different optimizers on CIFAR-10 and CIFAR-100.Top: training accuracies of different Network Models on CIFAR-10. Bottom: training accuracies of different Network Models on CIFAR-100
  • Figure 2: Trajectories of SGD, Adam, AdaBelief and HVAdam. The functions are $f_1$, $f_2$, $f_3$ and $f_4$ from (a) to (d). The functions are mentioned in the supplementary material's Sec.E. HVAdam reaches the optimal point (marked as orange cross in 2D and 3D plots) the fastest in all cases.
  • Figure 2: The test and training perplexity on Penn Treebank for 1,2,3-layer LSTM from left to right. Lower is better.
  • Figure 3: Consider $f(x,y)=4\vert x-y \vert + \vert x+y \vert$. Left: Optimization process for the example function. Our algorithm uses only two steps to get the hidden vector $v^*$. Right: the figure shows how we make $v_t$ approximate $v^*$ for the function.
  • ...and 5 more figures

Theorems & Definitions (8)

  • Theorem B.1
  • Lemma C.1
  • Theorem C.2
  • Corollary C.2.1
  • Theorem D.1
  • Lemma D.2
  • Theorem D.3
  • proof