Random Function Descent
Felix Benning, Leif Döring
TL;DR
The paper reframes optimization from a worst-case convex paradigm to a random-function framework, introducing Random Function Descent (RFD) via a stochastic Taylor approximation and arguing for scalability to high dimensions through a cost of $O(nd)$ per iteration with scale invariance.RFD updates are derived as $W_{n+1} = \Phi_{\mathbb{P}_{\mathbf{J}}}(W_n, \mathbf{J}(W_n), \nabla\mathbf{J}(W_n))$, which for isotropic Gaussian priors reduces to a gradient-direction step with a data-dependent step size $\eta^*(\Theta)$; this connects to gradient descent while incorporating a posterior-based trust term.Key contributions include a theoretical viability result under common covariance models, a practical covariance-estimation pipeline for mini-batch losses, and an MNIST case study showing competitive performance with minimal tuning; the work also clarifies how gradient clipping and warmup heuristics emerge from the RFD framework.Limitations center on Gaussian/isotropy assumptions and simple covariance models, with extensions proposed to non-stationary isotropy, geometric anisotropy, BLUE-based estimation, and online covariance updates to broaden applicability.
Abstract
Classical worst-case optimization theory neither explains the success of optimization in machine learning, nor does it help with step size selection. In this paper we demonstrate the viability and advantages of replacing the classical 'convex function' framework with a 'random function' framework. With complexity $\mathcal{O}(n^3d^3)$, where $n$ is the number of steps and $d$ the number of dimensions, Bayesian optimization with gradients has not been viable in large dimension so far. By bridging the gap between Bayesian optimization (i.e. random function optimization theory) and classical optimization we establish viability. Specifically, we use a 'stochastic Taylor approximation' to rediscover gradient descent, which is scalable in high dimension due to $\mathcal{O}(nd)$ complexity. This rediscovery yields a specific step size schedule we call Random Function Descent (RFD). The advantage of this random function framework is that RFD is scale invariant and that it provides a theoretical foundation for common step size heuristics such as gradient clipping and gradual learning rate warmup.
