Table of Contents
Fetching ...

Dynamic Estimation of Learning Rates Using a Non-Linear Autoregressive Model

Ramin Okhrati

TL;DR

This paper introduces a novel adaptive optimizer class, Nlar, based on a non-linear autoregressive view of gradient descent that jointly estimates per-dimension learning rates and momentum. The core update framework uses a clipped, noisy first-order signal $f(\partial L_t/\partial \theta_t(d))$, enabling stable convergence even for non-convex losses, with a provable strong-consistency result for the learning-rate estimators $\hat{\gamma}_{t+1}(d)$. Two momentum-augmented variants, Nlarcm and Nlarsm, dynamically adjust momentum via $\rho_t(d)$ and show robust performance with large initial learning rates and rapid early convergence across MNIST, CIFAR10, VGG11, and CartPole-v0 RL tasks, often outperforming Adam and AdamHD. The work provides detailed empirical evidence, sensible default parameter recommendations, and a roadmap for extending Nlar through richer noise models, alternative clipping functions, and Kalman-filter-like interpretations, highlighting its potential for broad applicability in deep learning and reinforcement learning settings.

Abstract

We introduce a new class of adaptive non-linear autoregressive (Nlar) models incorporating the concept of momentum, which dynamically estimate both the learning rates and momentum as the number of iterations increases. In our method, the growth of the gradients is controlled using a scaling (clipping) function, leading to stable convergence. Within this framework, we propose three distinct estimators for learning rates and provide theoretical proof of their convergence. We further demonstrate how these estimators underpin the development of effective Nlar optimizers. The performance of the proposed estimators and optimizers is rigorously evaluated through extensive experiments across several datasets and a reinforcement learning environment. The results highlight two key features of the Nlar optimizers: robust convergence despite variations in underlying parameters, including large initial learning rates, and strong adaptability with rapid convergence during the initial epochs.

Dynamic Estimation of Learning Rates Using a Non-Linear Autoregressive Model

TL;DR

This paper introduces a novel adaptive optimizer class, Nlar, based on a non-linear autoregressive view of gradient descent that jointly estimates per-dimension learning rates and momentum. The core update framework uses a clipped, noisy first-order signal , enabling stable convergence even for non-convex losses, with a provable strong-consistency result for the learning-rate estimators . Two momentum-augmented variants, Nlarcm and Nlarsm, dynamically adjust momentum via and show robust performance with large initial learning rates and rapid early convergence across MNIST, CIFAR10, VGG11, and CartPole-v0 RL tasks, often outperforming Adam and AdamHD. The work provides detailed empirical evidence, sensible default parameter recommendations, and a roadmap for extending Nlar through richer noise models, alternative clipping functions, and Kalman-filter-like interpretations, highlighting its potential for broad applicability in deep learning and reinforcement learning settings.

Abstract

We introduce a new class of adaptive non-linear autoregressive (Nlar) models incorporating the concept of momentum, which dynamically estimate both the learning rates and momentum as the number of iterations increases. In our method, the growth of the gradients is controlled using a scaling (clipping) function, leading to stable convergence. Within this framework, we propose three distinct estimators for learning rates and provide theoretical proof of their convergence. We further demonstrate how these estimators underpin the development of effective Nlar optimizers. The performance of the proposed estimators and optimizers is rigorously evaluated through extensive experiments across several datasets and a reinforcement learning environment. The results highlight two key features of the Nlar optimizers: robust convergence despite variations in underlying parameters, including large initial learning rates, and strong adaptability with rapid convergence during the initial epochs.

Paper Structure

This paper contains 38 sections, 3 theorems, 37 equations, 20 figures, 3 algorithms.

Key Result

Theorem 4

Suppose that Assumptions assump:epsilon, assump:grad, and assump:kprime hold. Then estimator $\hat{\gamma}_{t+1}(d)$ given by eq:ktheta converges to $\gamma(d)$ in eq:Nlar_main, as $t$ approaches infinity, i.e. $\lim_{t\rightarrow\infty}\hat{\gamma}_{t+1}(d) = \gamma(d)$.

Figures (20)

  • Figure 1: Logistic regression model on MNIST data: Performance comparison of Nlarsm and Nlarcm versus Adam and AdamHD (with $\beta = 10^{-7}$) across varied learning rates.
  • Figure 2: MLP2h on the MNIST dataset: Performance comparison of Nlarsm and Nlarcm versus Adam and AdamHD (with $\beta = 10^{-7}$) across varied learning rates.
  • Figure 3: MLP7h on the CIFAR10 dataset: Performance comparison of Nlarsm and Nlarcm versus Adam and AdamHD (with $\beta = 10^{-7}$) across varied learning rates.
  • Figure 4: VGG11 on the CIFAR10 dataset: Performance comparison of Nlarsm and Nlarcm versus Adam and AdamHD (with $\beta = 10^{-7}$) across varied learning rates.
  • Figure 5: CartPole-v0: Performance comparison of Nlarsm and Nlarcm versus Adam and AdamHD (with $\beta = 10^{-4}$) across varied learning rates.
  • ...and 15 more figures

Theorems & Definitions (7)

  • Theorem 4
  • Theorem 5
  • Definition 6: Nlarcm
  • Proposition 7
  • Definition 8: Nlarsm
  • Remark 9
  • Remark 10