Random Function Descent

Felix Benning; Leif Döring

Random Function Descent

Felix Benning, Leif Döring

TL;DR

The paper reframes optimization from a worst-case convex paradigm to a random-function framework, introducing Random Function Descent (RFD) via a stochastic Taylor approximation and arguing for scalability to high dimensions through a cost of $O(nd)$ per iteration with scale invariance.RFD updates are derived as $W_{n+1} = \Phi_{\mathbb{P}_{\mathbf{J}}}(W_n, \mathbf{J}(W_n), \nabla\mathbf{J}(W_n))$, which for isotropic Gaussian priors reduces to a gradient-direction step with a data-dependent step size $\eta^*(\Theta)$; this connects to gradient descent while incorporating a posterior-based trust term.Key contributions include a theoretical viability result under common covariance models, a practical covariance-estimation pipeline for mini-batch losses, and an MNIST case study showing competitive performance with minimal tuning; the work also clarifies how gradient clipping and warmup heuristics emerge from the RFD framework.Limitations center on Gaussian/isotropy assumptions and simple covariance models, with extensions proposed to non-stationary isotropy, geometric anisotropy, BLUE-based estimation, and online covariance updates to broaden applicability.

Abstract

Classical worst-case optimization theory neither explains the success of optimization in machine learning, nor does it help with step size selection. In this paper we demonstrate the viability and advantages of replacing the classical 'convex function' framework with a 'random function' framework. With complexity $\mathcal{O}(n^3d^3)$, where $n$ is the number of steps and $d$ the number of dimensions, Bayesian optimization with gradients has not been viable in large dimension so far. By bridging the gap between Bayesian optimization (i.e. random function optimization theory) and classical optimization we establish viability. Specifically, we use a 'stochastic Taylor approximation' to rediscover gradient descent, which is scalable in high dimension due to $\mathcal{O}(nd)$ complexity. This rediscovery yields a specific step size schedule we call Random Function Descent (RFD). The advantage of this random function framework is that RFD is scale invariant and that it provides a theoretical foundation for common step size heuristics such as gradient clipping and gradual learning rate warmup.

Random Function Descent

TL;DR

Abstract

, where

is the number of steps and

the number of dimensions, Bayesian optimization with gradients has not been viable in large dimension so far. By bridging the gap between Bayesian optimization (i.e. random function optimization theory) and classical optimization we establish viability. Specifically, we use a 'stochastic Taylor approximation' to rediscover gradient descent, which is scalable in high dimension due to

complexity. This rediscovery yields a specific step size schedule we call Random Function Descent (RFD). The advantage of this random function framework is that RFD is scale invariant and that it provides a theoretical foundation for common step size heuristics such as gradient clipping and gradual learning rate warmup.

Paper Structure (44 sections, 33 theorems, 228 equations, 7 figures, 1 table)

This paper contains 44 sections, 33 theorems, 228 equations, 7 figures, 1 table.

Introduction
The random function descent algorithm
A distribution over cost functions
Relation to gradient descent
The RFD step size schedule
Asymptotic learning rate
RFD step sizes explain common step size heuristics
Mini-batch loss and covariance estimation
Variance estimation
Stochastic RFD (S-RFD)
MNIST case study
Limitations and extensions
Conclusion
Experiments
Covariance estimation
...and 29 more sections

Key Result

Lemma 4.0

For $\mathbf{J}\sim\mathcal{N}(\mu, C)$, the first order stochastic Taylor approximation is given by

Figures (7)

Figure 1: The stochastic Taylor approximation naturally contains a trust bound in contrast to the classical one. Here $\mathbf{J}$ is a Gaussian random function (with covariance as in Equation \ref{['eq: sqExp\n\t\tcovariance model']}, with length scale $s=2$ and variance $\sigma^2=1$). The ribbon represents two conditional standard deviations around the conditional expectation.
Figure 2: RFD step sizes as a function of $\Theta=\frac{\|\nabla\mathbf{J}(w)\|}{\mu - \mathbf{J}(w)}$ assuming scale $s=1$ (cf. Table \ref{['table: optimal step\n\t\tsize']}). A-RFD (Definition \ref{['def: a-rfd']}) is plotted as dashed lines. A-RFD of the rational quadratic coincides with A-RFD of the squared exponential covariance.
Figure 3: Training on the MNIST dataset (batch size $1024$). Ribbons describe the range between the $10\%$ and $90\%$ quantile of $20$ repeated experiments while lines represent their mean. SE stands for the squared exponential \ref{['eq: sqExp\n\t\tcovariance model']} and RQ for the rational quadratic \ref{['eq:\n\t\trational quadratic']} covariance. The validation loss uses the test data set, which provides a small advantage to Adam and SGD, as we also use it for tuning.
Figure 4: Visualization of the variance estimation (Section \ref{['subsec:\n\t\tnon-parametric covariance estimation']}) with $95\%$-confidence intervals based on the assumed distribution. Quantile-quantile (QQ) plots of the losses (against a normal distribution), squared losses (against a $\chi^2(1)$ distribution) and squared gradient norms (against a $\chi^2(d)$-distribution) are displayed on the right for a selection of batch sizes.
Figure 5: $20$ repeated covariance estimations of model M7 anEnsembleSimpleConvolutional2020 applied to the MNIST dataset. On the left are the resulting asymptotic learning rates (assuming a final loss of zero) and on the right are the samples used until the stopping criterion interrupted sampling.
...and 2 more figures

Theorems & Definitions (78)

Definition 2.1: Stochastic Taylor approximation
Definition 2.2: Random Function Descent -- RFD
Definition 3.1: Isotropy
Lemma 4.0: Explicit first order stochastic Taylor approximation
Theorem 4.0: Explicit RFD
Remark 4.1: Scalable complexity
Remark 4.2: Step until the given information is no longer informative
Definition 5.1: A-RFD
Proposition 5.2: A-RFD is well defined
Corollary 5.2
...and 68 more

Random Function Descent

TL;DR

Abstract

Random Function Descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (78)