Table of Contents
Fetching ...

Gradient Descent as Loss Landscape Navigation: a Normative Framework for Deriving Learning Rules

John J. Vastola, Samuel J. Gershman, Kanaka Rajan

TL;DR

This work introduces a normative, continuous-time framework that treats learning rules as optimal-control policies navigating partially observable loss landscapes. By varying planning horizon, ambient geometry, and belief updating, it unifies gradient descent, momentum, natural gradient descent, Adam, and continual-learning strategies as special cases of a single objective. The approach clarifies how geometry and observability shape learning dynamics, and provides principled grounds for deriving or comparing learning rules beyond empirical tuning. It also connects these ideas to physics and biology, suggesting broader implications for designing adaptive algorithms and interpreting brain-inspired plasticity under realistic constraints.

Abstract

Learning rules -- prescriptions for updating model parameters to improve performance -- are typically assumed rather than derived. Why do some learning rules work better than others, and under what assumptions can a given rule be considered optimal? We propose a theoretical framework that casts learning rules as policies for navigating (partially observable) loss landscapes, and identifies optimal rules as solutions to an associated optimal control problem. A range of well-known rules emerge naturally within this framework under different assumptions: gradient descent from short-horizon optimization, momentum from longer-horizon planning, natural gradients from accounting for parameter space geometry, non-gradient rules from partial controllability, and adaptive optimizers like Adam from online Bayesian inference of loss landscape shape. We further show that continual learning strategies like weight resetting can be understood as optimal responses to task uncertainty. By unifying these phenomena under a single objective, our framework clarifies the computational structure of learning and offers a principled foundation for designing adaptive algorithms.

Gradient Descent as Loss Landscape Navigation: a Normative Framework for Deriving Learning Rules

TL;DR

This work introduces a normative, continuous-time framework that treats learning rules as optimal-control policies navigating partially observable loss landscapes. By varying planning horizon, ambient geometry, and belief updating, it unifies gradient descent, momentum, natural gradient descent, Adam, and continual-learning strategies as special cases of a single objective. The approach clarifies how geometry and observability shape learning dynamics, and provides principled grounds for deriving or comparing learning rules beyond empirical tuning. It also connects these ideas to physics and biology, suggesting broader implications for designing adaptive algorithms and interpreting brain-inspired plasticity under realistic constraints.

Abstract

Learning rules -- prescriptions for updating model parameters to improve performance -- are typically assumed rather than derived. Why do some learning rules work better than others, and under what assumptions can a given rule be considered optimal? We propose a theoretical framework that casts learning rules as policies for navigating (partially observable) loss landscapes, and identifies optimal rules as solutions to an associated optimal control problem. A range of well-known rules emerge naturally within this framework under different assumptions: gradient descent from short-horizon optimization, momentum from longer-horizon planning, natural gradients from accounting for parameter space geometry, non-gradient rules from partial controllability, and adaptive optimizers like Adam from online Bayesian inference of loss landscape shape. We further show that continual learning strategies like weight resetting can be understood as optimal responses to task uncertainty. By unifying these phenomena under a single objective, our framework clarifies the computational structure of learning and offers a principled foundation for designing adaptive algorithms.

Paper Structure

This paper contains 63 sections, 164 equations, 3 figures.

Figures (3)

  • Figure 1: Basic idea of our framework and a simple example.a. Single-step approaches (top) optimize over short-term changes to the loss, while a multi-step approach (bottom) optimizes over longer-term changes. b. Gradient descent vs multi-step optimization for a double-well loss, with the optimal multi-step trajectory computed by directly minimizing the objective (see Appendix \ref{['app:exp']} for details). Note that the multi-step rule converges to the global rather than local minimum. c. Values of the kinetic (top) and potential (bottom) terms along the optimal trajectory from (b). The loss/potential does not decrease monotonically, since the learner must first escape a local minimum.
  • Figure 2: Effect of modulating temporal discounting rate.a. Example optimal $\theta_t$ traces for a 1D quadratic loss, assuming different values of the temporal discounting rate ($\gamma = 0, 1, 10$). b. Loss over time given the $\theta_t$ from (a), same values of $\gamma$. Note that lower values of $\gamma$ ('longer' planning horizon) produce loss curves that converge more quickly. c. Shape of trajectories for different $\gamma$ given a 2D anisotropic loss. In the gradient-descent-like regime ($\gamma \gg 1$), $\theta_2$ converges much more quickly than $\theta_1$ due to the anisotropy. In the ballistic ($\gamma \approx 0$) regime, the difference in convergence rates is not as extreme. d. Ratio of convergence rates $r_i := \sqrt{\gamma^2/4 + \eta k H_{ii}} - \gamma/2$ assuming a diagonal Hessian. In the gradient-descent-like regime, directions with four times as much curvature converge $4$ times faster; in the ballistic regime, they only converge $\sqrt{4} = 2$ times faster. e. An Adam-like implementation of the ballistic ($\gamma \approx 0$) rule was used to train a small multilayer perceptron (MLP) to classify MNIST digits. Left: loss over training, right: test set accuracy over training. f. Same as in (e), but for a small convolutional neural network (CNN) trained to classify CIFAR-10 images. The ballistic rule generally performs better than SGD (black), and similarly to or worse than Adam (red).
  • Figure 3: Parameter space geometry affects optimal learning trajectories.a. Optimal trajectory through $\theta_1$-$\theta_2$ space for an isotropic quadratic loss, assuming no nontrivial $\boldsymbol{G}$ and $\boldsymbol{f} \equiv \boldsymbol{0}$. The heatmap and contours show the value of the loss at each $(\theta_1, \theta_2)$ value. Black line: optimal trajectory, red dot: global minimum of loss. Note that, because the loss is isotropic, the optimal trajectory is too. b. Same as (a), but given a strongly anisotropic constant metric $\boldsymbol{G}$. Note that the optimal trajectory is no longer the same along each direction, but converges much more quickly along the $\theta_1$ direction. c. Same as (a), but given $\boldsymbol{f}$ that corresponds to purely rotational dynamics. Note two differences: it spirals about the origin, and no longer converges to the global minimum of the loss, but to a different point closer to the origin (orange dot). d. Same as (a), but given $\boldsymbol{f}$ that corresponds to weight decay. There is no anisotropy, but the trajectory does not converge to the minimum of the loss.