Table of Contents
Fetching ...

Geometrical structures of digital fluctuations in parameter space of neural networks trained with adaptive momentum optimization

Igor V. Netay

TL;DR

The paper investigates numerical instability in neural network training with adaptive momentum (Adam) and identifies finite-precision arithmetic as a primary source of divergence. Using large-scale experiments (≈1600 networks, 50k epochs), it reveals geometrical structures in parameter space—double twisted spirals—that arise from the interaction of digital noise with relaxation oscillations in the first and second moment estimates. The authors show these spirals correlate with hyper-parameters, with fluctuation periods near $\frac{1}{1-\beta_2}$ and fast components near $\frac{1}{1-\beta_1}$, linking geometry to optimizer settings. They argue that local dynamics analysis can serve as a practical tool for predicting instability and guiding stability-aware training, highlighting numerical inexactness as a fundamental limit.

Abstract

We present results of numerical experiments for neural networks with stochastic gradient-based optimization with adaptive momentum. This widely applied optimization has proved convergence and practical efficiency, but for long-run training becomes numerically unstable. We show that numerical artifacts are observable not only for large-scale models and finally lead to divergence also for case of shallow narrow networks. We argue this theory by experiments with more than 1600 neural networks trained for 50000 epochs. Local observations show presence of the same behavior of network parameters in both stable and unstable training segments. Geometrical behavior of parameters forms double twisted spirals in the parameter space and is caused by alternating of numerical perturbations with next relaxation oscillations in values for 1st and 2nd momentum.

Geometrical structures of digital fluctuations in parameter space of neural networks trained with adaptive momentum optimization

TL;DR

The paper investigates numerical instability in neural network training with adaptive momentum (Adam) and identifies finite-precision arithmetic as a primary source of divergence. Using large-scale experiments (≈1600 networks, 50k epochs), it reveals geometrical structures in parameter space—double twisted spirals—that arise from the interaction of digital noise with relaxation oscillations in the first and second moment estimates. The authors show these spirals correlate with hyper-parameters, with fluctuation periods near and fast components near , linking geometry to optimizer settings. They argue that local dynamics analysis can serve as a practical tool for predicting instability and guiding stability-aware training, highlighting numerical inexactness as a fundamental limit.

Abstract

We present results of numerical experiments for neural networks with stochastic gradient-based optimization with adaptive momentum. This widely applied optimization has proved convergence and practical efficiency, but for long-run training becomes numerically unstable. We show that numerical artifacts are observable not only for large-scale models and finally lead to divergence also for case of shallow narrow networks. We argue this theory by experiments with more than 1600 neural networks trained for 50000 epochs. Local observations show presence of the same behavior of network parameters in both stable and unstable training segments. Geometrical behavior of parameters forms double twisted spirals in the parameter space and is caused by alternating of numerical perturbations with next relaxation oscillations in values for 1st and 2nd momentum.
Paper Structure (7 sections, 7 figures)

This paper contains 7 sections, 7 figures.

Figures (7)

  • Figure 1: Common behavior of some pair of parameters and loss for $(12, 24)$ (left) and $(16, 29)$ (right).
  • Figure 2: Loss behavior for $(17,12)$, $(20,20)$, $(26,26)$, $(27,27)$.
  • Figure 3: Behavior of some parameters for $(17,12)$, $(27,27)$, $(20,20)$, $(27,27)$.
  • Figure 4: Common behavior of some parameters for $(26,26)$, $(27,27)$.
  • Figure 5: Common behavior of some parameters for $(21, 21)$, $(17,12)$ (2 pictures).
  • ...and 2 more figures