Table of Contents
Fetching ...

Random Function Descent

Felix Benning, Leif Döring

TL;DR

The paper reframes optimization from a worst-case convex paradigm to a random-function framework, introducing Random Function Descent (RFD) via a stochastic Taylor approximation and arguing for scalability to high dimensions through a cost of $O(nd)$ per iteration with scale invariance.RFD updates are derived as $W_{n+1} = \Phi_{\mathbb{P}_{\mathbf{J}}}(W_n, \mathbf{J}(W_n), \nabla\mathbf{J}(W_n))$, which for isotropic Gaussian priors reduces to a gradient-direction step with a data-dependent step size $\eta^*(\Theta)$; this connects to gradient descent while incorporating a posterior-based trust term.Key contributions include a theoretical viability result under common covariance models, a practical covariance-estimation pipeline for mini-batch losses, and an MNIST case study showing competitive performance with minimal tuning; the work also clarifies how gradient clipping and warmup heuristics emerge from the RFD framework.Limitations center on Gaussian/isotropy assumptions and simple covariance models, with extensions proposed to non-stationary isotropy, geometric anisotropy, BLUE-based estimation, and online covariance updates to broaden applicability.

Abstract

Classical worst-case optimization theory neither explains the success of optimization in machine learning, nor does it help with step size selection. In this paper we demonstrate the viability and advantages of replacing the classical 'convex function' framework with a 'random function' framework. With complexity $\mathcal{O}(n^3d^3)$, where $n$ is the number of steps and $d$ the number of dimensions, Bayesian optimization with gradients has not been viable in large dimension so far. By bridging the gap between Bayesian optimization (i.e. random function optimization theory) and classical optimization we establish viability. Specifically, we use a 'stochastic Taylor approximation' to rediscover gradient descent, which is scalable in high dimension due to $\mathcal{O}(nd)$ complexity. This rediscovery yields a specific step size schedule we call Random Function Descent (RFD). The advantage of this random function framework is that RFD is scale invariant and that it provides a theoretical foundation for common step size heuristics such as gradient clipping and gradual learning rate warmup.

Random Function Descent

TL;DR

The paper reframes optimization from a worst-case convex paradigm to a random-function framework, introducing Random Function Descent (RFD) via a stochastic Taylor approximation and arguing for scalability to high dimensions through a cost of $O(nd)$ per iteration with scale invariance.RFD updates are derived as $W_{n+1} = \Phi_{\mathbb{P}_{\mathbf{J}}}(W_n, \mathbf{J}(W_n), \nabla\mathbf{J}(W_n))$, which for isotropic Gaussian priors reduces to a gradient-direction step with a data-dependent step size $\eta^*(\Theta)$; this connects to gradient descent while incorporating a posterior-based trust term.Key contributions include a theoretical viability result under common covariance models, a practical covariance-estimation pipeline for mini-batch losses, and an MNIST case study showing competitive performance with minimal tuning; the work also clarifies how gradient clipping and warmup heuristics emerge from the RFD framework.Limitations center on Gaussian/isotropy assumptions and simple covariance models, with extensions proposed to non-stationary isotropy, geometric anisotropy, BLUE-based estimation, and online covariance updates to broaden applicability.

Abstract

Classical worst-case optimization theory neither explains the success of optimization in machine learning, nor does it help with step size selection. In this paper we demonstrate the viability and advantages of replacing the classical 'convex function' framework with a 'random function' framework. With complexity , where is the number of steps and the number of dimensions, Bayesian optimization with gradients has not been viable in large dimension so far. By bridging the gap between Bayesian optimization (i.e. random function optimization theory) and classical optimization we establish viability. Specifically, we use a 'stochastic Taylor approximation' to rediscover gradient descent, which is scalable in high dimension due to complexity. This rediscovery yields a specific step size schedule we call Random Function Descent (RFD). The advantage of this random function framework is that RFD is scale invariant and that it provides a theoretical foundation for common step size heuristics such as gradient clipping and gradual learning rate warmup.
Paper Structure (44 sections, 33 theorems, 228 equations, 7 figures, 1 table)

This paper contains 44 sections, 33 theorems, 228 equations, 7 figures, 1 table.

Key Result

Lemma 4.0

For $\mathbf{J}\sim\mathcal{N}(\mu, C)$, the first order stochastic Taylor approximation is given by

Figures (7)

  • Figure 1: The stochastic Taylor approximation naturally contains a trust bound in contrast to the classical one. Here $\mathbf{J}$ is a Gaussian random function (with covariance as in Equation \ref{['eq: sqExp\n\t\tcovariance model']}, with length scale $s=2$ and variance $\sigma^2=1$). The ribbon represents two conditional standard deviations around the conditional expectation.
  • Figure 2: RFD step sizes as a function of $\Theta=\frac{\|\nabla\mathbf{J}(w)\|}{\mu - \mathbf{J}(w)}$ assuming scale $s=1$ (cf. Table \ref{['table: optimal step\n\t\tsize']}). A-RFD (Definition \ref{['def: a-rfd']}) is plotted as dashed lines. A-RFD of the rational quadratic coincides with A-RFD of the squared exponential covariance.
  • Figure 3: Training on the MNIST dataset (batch size $1024$). Ribbons describe the range between the $10\%$ and $90\%$ quantile of $20$ repeated experiments while lines represent their mean. SE stands for the squared exponential \ref{['eq: sqExp\n\t\tcovariance model']} and RQ for the rational quadratic \ref{['eq:\n\t\trational quadratic']} covariance. The validation loss uses the test data set, which provides a small advantage to Adam and SGD, as we also use it for tuning.
  • Figure 4: Visualization of the variance estimation (Section \ref{['subsec:\n\t\tnon-parametric covariance estimation']}) with $95\%$-confidence intervals based on the assumed distribution. Quantile-quantile (QQ) plots of the losses (against a normal distribution), squared losses (against a $\chi^2(1)$ distribution) and squared gradient norms (against a $\chi^2(d)$-distribution) are displayed on the right for a selection of batch sizes.
  • Figure 5: $20$ repeated covariance estimations of model M7 anEnsembleSimpleConvolutional2020 applied to the MNIST dataset. On the left are the resulting asymptotic learning rates (assuming a final loss of zero) and on the right are the samples used until the stopping criterion interrupted sampling.
  • ...and 2 more figures

Theorems & Definitions (78)

  • Definition 2.1: Stochastic Taylor approximation
  • Definition 2.2: Random Function Descent -- RFD
  • Definition 3.1: Isotropy
  • Lemma 4.0: Explicit first order stochastic Taylor approximation
  • Theorem 4.0: Explicit RFD
  • Remark 4.1: Scalable complexity
  • Remark 4.2: Step until the given information is no longer informative
  • Definition 5.1: A-RFD
  • Proposition 5.2: A-RFD is well defined
  • Corollary 5.2
  • ...and 68 more