Gradient is All You Need? How Consensus-Based Optimization can be Interpreted as a Stochastic Relaxation of Gradient Descent

Konstantin Riedl; Timo Klock; Carina Geldhauser; Massimo Fornasier

Gradient is All You Need? How Consensus-Based Optimization can be Interpreted as a Stochastic Relaxation of Gradient Descent

Konstantin Riedl, Timo Klock, Carina Geldhauser, Massimo Fornasier

TL;DR

This paper interprets consensus-based optimization (CBO), a recently proposed multi-particle derivative-free optimization method, as a stochastic relaxation of gradient descent, and observes that through communication of the particles, CBO exhibits a stochastic gradient descent (SGD)-like behavior.

Abstract

In this paper, we provide a novel analytical perspective on the theoretical understanding of gradient-based learning algorithms by interpreting consensus-based optimization (CBO), a recently proposed multi-particle derivative-free optimization method, as a stochastic relaxation of gradient descent. Remarkably, we observe that through communication of the particles, CBO exhibits a stochastic gradient descent (SGD)-like behavior despite solely relying on evaluations of the objective function. The fundamental value of such link between CBO and SGD lies in the fact that CBO is provably globally convergent to global minimizers for ample classes of nonsmooth and nonconvex objective functions. Hence, on the one side, we offer a novel explanation for the success of stochastic relaxations of gradient descent by furnishing useful and precise insights that explain how problem-tailored stochastic perturbations of gradient descent (like the ones induced by CBO) overcome energy barriers and reach deep levels of nonconvex functions. On the other side, and contrary to the conventional wisdom for which derivative-free methods ought to be inefficient or not to possess generalization abilities, our results unveil an intrinsic gradient descent nature of heuristics. Instructive numerical illustrations support the provided theoretical insights.

Gradient is All You Need? How Consensus-Based Optimization can be Interpreted as a Stochastic Relaxation of Gradient Descent

TL;DR

Abstract

Paper Structure (36 sections, 20 theorems, 171 equations, 5 figures)

This paper contains 36 sections, 20 theorems, 171 equations, 5 figures.

Introduction
Contributions
Organization
Notation
Characterization of the class of objective functions
Consensus-based optimization and the main result
Consensus-based optimization converges globally
Consensus-based optimization is a stochastic relaxation of gradient descent
Proof of the main result, Theorem \ref{['thm:main_informal']}
Technical details connecting CBO with GD via the CH scheme \ref{['eq:CH']}
Technical details connecting CBO with GD via the CH scheme \ref{['eq:CH']}
Conclusions
Introductory facts
Notation
Convex analysis
...and 21 more sections

Key Result

Theorem 3.1

Let $\mathcal{E}\in\mathcal{C}^{1}(\mathbb{R}^d)$ be $L$-smoothA function $f\in\mathcal{C}^{1}(\mathbb{R}^d)$ is $L$-smooth if $\left\|{\nabla f(x)-\nabla f(x')}\right\| _2 \leq L\left\|{x-x'}\right\|_2$ for all $x,x'\in \mathbb{R}^d$. and satisfy minimal assumptions (summarized in Assumption asm:ob where $g_k$ is stochastic noise fulfilling for each $k=1,\dots,K$ with high probability the quantit

Figures (5)

Figure 1: An illustration of the intuition that the CBO scheme \ref{['eq:CBO']} can be regarded as a stochastic derivative-free (zero-order) relaxation of GD. To find the global minimizer ${x^*}$ of the nonconvex objective function $\mathcal{E}$ depicted in (a), we run the CBO algorithm \ref{['eq:CBO_dynamics']} for $K=250$ iterations with parameters $\Delta t=0.01$, $\alpha = 100$, $\lambda = 1$ and $\sigma = 1.6$, and $N=200$ particles, initialized i.i.d. according to $\rho_0 = \mathcal{N}((8,8), 0.5\mathrm{Id})$. This experiment is performed $50$ times. For each run we depict in (b) the positions of the consensus points computed during the CBO algorithm \ref{['eq:CBO_dynamics']}, i.e., the iterates of the CBO scheme \ref{['eq:CBO']} for $k=1,\dots,K$. The color of the individual points corresponds to time, i.e., iterates at the beginning of the scheme are plotted in blue, whereas later iterates are colored orange. We observe that, after starting close to the initial position, the trajectories of the consensus points follow the path of the valley leading to the global minimizer ${x^*}$, until it is reached. In particular, unlike GD (cf. Figure \ref{['fig:GD_GrandCanyon3noisy']}), the scheme \ref{['eq:CBO']} has the capability of jumping over locally deeper passages. Such desirable behavior is observed also for the Langevin dynamics (see Figure \ref{['fig:Langevin_GrandCanyon3noisy']}), which can be regarded as a stochastic (noisy) version of GD.
Figure 2: Quantitative numerical analysis of the approximation error between the trajectories of the CBO scheme \ref{['eq:CBO']} and GD, i.e., the scaling of the stochastic noise $g_k$ in \ref{['eq:thm:main_informal']}. In the setting of the Canyon function $\mathcal{E}$ from Figure \ref{['fig:GrandCanyon3noisy']} but without a local minimum in the valley, we measure the distance between the two trajectories and plot the resulting approximation error for different values of $\alpha$ ((a) and (b)), different values of $\lambda$ (different colors), $\sigma$ (horizontal axis), and $N$ (different line styles). The other parameters of the CBO scheme \ref{['eq:CBO']} are $K=1000$ and $\Delta t=0.1$ with the remaining setting being as in Figure \ref{['fig:intuitionGiAyN']}. The results validate the theoretical scalings on $\left\|{g_k}\right\|_2$ predicted by Theorem \ref{['thm:main_informal']}.
Figure 3: An illustrative comparison between the algorithms discussed in this work. While GD (obtained as an explicit Euler time discretization of $\frac{d}{dt}x(t) = -\nabla\mathcal{E}(x(t))$ with time step size $\Delta t=0.01$ and ran for $K=10^4$ iterations) gets stuck in a local minimum along the valley of $\mathcal{E}$ (see (b)), the stochastic algorithms in (a) and (c) as well as Figure \ref{['fig:CBO_GrandCanyon3noisy']} have the capability of escaping local minima. In (a) we depict the positions of the consensus hopping scheme \ref{['eq:CH']} for $K=250$ iterations with parameters $\alpha = 100$ and $\widetilde{\sigma} = 0.6$, and where we approximate the underlying measure $\mu_k$ at each step $k$ using $200$ samples. The ability of the CH scheme to escape local minima improves with larger $\widetilde{\sigma}$, see Figure \ref{['fig:comparison_CH']} in Appendix \ref{['sec:appendix:additional_numerics']}. In (c) we depict the trajectory of the annealed Langevin dynamics with $\beta_t=0.02\log(t+1)$ (obtained as an Euler-Maruyama time discretization with time step size $\Delta t=0.001$ and ran for $K=10^4$ iterations). The remaining setting is as in Figure \ref{['fig:intuitionGiAyN']}, in particular, $50$ individual runs of the experiment are plotted in (a) and (c).
Figure 4: A visual comparison of the CH scheme \ref{['eq:CH']} for different sampling widths $\widetilde{\sigma}$. We depict the positions of the consensus hopping scheme \ref{['eq:CH']} for different values of $\widetilde{\sigma}$ ($0.4$ in (a), $0.6$ in (b) and $0.7$ in (c)) in the setting of Figure \ref{['fig:CH_GrandCanyon3noisy']}. While for small $\widetilde{\sigma}$ the numerical scheme gets stuck in a local minimum of the objective, the ability to escape such critical points improves with larger $\widetilde{\sigma}$. Notice that (b) coincides with Figure \ref{['fig:CH_GrandCanyon3noisy']}.
Figure 5: An additional numerical experiment illustrating the behavior of the CBO scheme \ref{['eq:CBO']} (see (b)), the consensus hopping scheme \ref{['eq:CH']} (see (c)), GD (see (d)) and the overdamped Langevin dynamics (see (e)) in search of the global minimizer ${x^*}$ of the nonconvex objective function $\mathcal{E}$ depicted in (a). The experimental setting is the one of Figures \ref{['fig:intuitionGiAyN']} and \ref{['fig:comparison_algorithms']} with the only difference of the particles being initialized around $(5,-1)$.

Theorems & Definitions (43)

Remark 2.1
Theorem 3.1: CBO is a stochastic relaxation of GD (main result)
Remark 3.2: Stochastic relaxations of GD
Theorem 4.1: CBO asymptotically convexifies nonconvex problems, fornasier2021convergence
Theorem 4.2: Global CBO convergence, fornasier2021consensus
proof : Proof of Theorem \ref{['thm:main_informal']}
Theorem 5.1: CBO relaxes CH
proof : Proof of Theorem \ref{['thm:relaxation_CBO_CH']}
Theorem 5.2: CBO relaxes CH
proof : Proof of Theorem \ref{['thm:relaxation_CBO_CH']}
...and 33 more

Gradient is All You Need? How Consensus-Based Optimization can be Interpreted as a Stochastic Relaxation of Gradient Descent

TL;DR

Abstract

Gradient is All You Need? How Consensus-Based Optimization can be Interpreted as a Stochastic Relaxation of Gradient Descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (43)