Dynamic Estimation of Learning Rates Using a Non-Linear Autoregressive Model
Ramin Okhrati
TL;DR
This paper introduces a novel adaptive optimizer class, Nlar, based on a non-linear autoregressive view of gradient descent that jointly estimates per-dimension learning rates and momentum. The core update framework uses a clipped, noisy first-order signal $f(\partial L_t/\partial \theta_t(d))$, enabling stable convergence even for non-convex losses, with a provable strong-consistency result for the learning-rate estimators $\hat{\gamma}_{t+1}(d)$. Two momentum-augmented variants, Nlarcm and Nlarsm, dynamically adjust momentum via $\rho_t(d)$ and show robust performance with large initial learning rates and rapid early convergence across MNIST, CIFAR10, VGG11, and CartPole-v0 RL tasks, often outperforming Adam and AdamHD. The work provides detailed empirical evidence, sensible default parameter recommendations, and a roadmap for extending Nlar through richer noise models, alternative clipping functions, and Kalman-filter-like interpretations, highlighting its potential for broad applicability in deep learning and reinforcement learning settings.
Abstract
We introduce a new class of adaptive non-linear autoregressive (Nlar) models incorporating the concept of momentum, which dynamically estimate both the learning rates and momentum as the number of iterations increases. In our method, the growth of the gradients is controlled using a scaling (clipping) function, leading to stable convergence. Within this framework, we propose three distinct estimators for learning rates and provide theoretical proof of their convergence. We further demonstrate how these estimators underpin the development of effective Nlar optimizers. The performance of the proposed estimators and optimizers is rigorously evaluated through extensive experiments across several datasets and a reinforcement learning environment. The results highlight two key features of the Nlar optimizers: robust convergence despite variations in underlying parameters, including large initial learning rates, and strong adaptability with rapid convergence during the initial epochs.
