Geometrical structures of digital fluctuations in parameter space of neural networks trained with adaptive momentum optimization
Igor V. Netay
TL;DR
The paper investigates numerical instability in neural network training with adaptive momentum (Adam) and identifies finite-precision arithmetic as a primary source of divergence. Using large-scale experiments (≈1600 networks, 50k epochs), it reveals geometrical structures in parameter space—double twisted spirals—that arise from the interaction of digital noise with relaxation oscillations in the first and second moment estimates. The authors show these spirals correlate with hyper-parameters, with fluctuation periods near $\frac{1}{1-\beta_2}$ and fast components near $\frac{1}{1-\beta_1}$, linking geometry to optimizer settings. They argue that local dynamics analysis can serve as a practical tool for predicting instability and guiding stability-aware training, highlighting numerical inexactness as a fundamental limit.
Abstract
We present results of numerical experiments for neural networks with stochastic gradient-based optimization with adaptive momentum. This widely applied optimization has proved convergence and practical efficiency, but for long-run training becomes numerically unstable. We show that numerical artifacts are observable not only for large-scale models and finally lead to divergence also for case of shallow narrow networks. We argue this theory by experiments with more than 1600 neural networks trained for 50000 epochs. Local observations show presence of the same behavior of network parameters in both stable and unstable training segments. Geometrical behavior of parameters forms double twisted spirals in the parameter space and is caused by alternating of numerical perturbations with next relaxation oscillations in values for 1st and 2nd momentum.
