Leveraging Memory Effects and Gradient Information in Consensus-Based Optimization: On Global Convergence in Mean-Field Law

Konstantin Riedl

Leveraging Memory Effects and Gradient Information in Consensus-Based Optimization: On Global Convergence in Mean-Field Law

Konstantin Riedl

TL;DR

The paper tackles global optimization in high dimensions for nonconvex, nonsmooth objectives by analyzing a memory-augmented consensus-based optimization (CBO) method that also incorporates gradient information. It develops a mean-field framework, deriving a nonlinear SDE and the corresponding Fokker-Planck equation, and proves exponential convergence of the law to the global minimizer $x^*$ of ${\cal E}$ via a Lyapunov functional ${\cal V}(\rho_t)$ with explicit rates. A quantitative Laplace principle bound and a positive lower bound on the probability mass near $x^*$ enable nonasymptotic convergence guarantees, while numerical experiments in machine learning and compressed sensing illustrate practical benefits of memory and gradient terms. Overall, the work provides rigorous global convergence theory for a flexible CBO variant and demonstrates its effectiveness across challenging high-dimensional tasks.

Abstract

In this paper we study consensus-based optimization (CBO), a versatile, flexible and customizable optimization method suitable for performing nonconvex and nonsmooth global optimizations in high dimensions. CBO is a multi-particle metaheuristic, which is effective in various applications and at the same time amenable to theoretical analysis thanks to its minimalistic design. The underlying dynamics, however, is flexible enough to incorporate different mechanisms widely used in evolutionary computation and machine learning, as we show by analyzing a variant of CBO which makes use of memory effects and gradient information. We rigorously prove that this dynamics converges to a global minimizer of the objective function in mean-field law for a vast class of functions under minimal assumptions on the initialization of the method. The proof in particular reveals how to leverage further, in some applications advantageous, forces in the dynamics without loosing provable global convergence. To demonstrate the benefit of the herein investigated memory effects and gradient information in certain applications, we present numerical evidence for the superiority of this CBO variant in applications such as machine learning and compressed sensing, which en passant widen the scope of applications of CBO.

Leveraging Memory Effects and Gradient Information in Consensus-Based Optimization: On Global Convergence in Mean-Field Law

TL;DR

via a Lyapunov functional

with explicit rates. A quantitative Laplace principle bound and a positive lower bound on the probability mass near

enable nonasymptotic convergence guarantees, while numerical experiments in machine learning and compressed sensing illustrate practical benefits of memory and gradient terms. Overall, the work provides rigorous global convergence theory for a flexible CBO variant and demonstrates its effectiveness across challenging high-dimensional tasks.

Abstract

Paper Structure (19 sections, 10 theorems, 119 equations, 6 figures)

This paper contains 19 sections, 10 theorems, 119 equations, 6 figures.

Introduction
Versatility and Flexibility of CBO --- A Literature Overview.
Contributions.
Organization
Notation
Global Convergence in Mean-Field Law
Definition and Existence of Weak Solutions
Main Result
Proof details for Section \ref{['subsec:main_results']}
Evolution of the Mean-Field Limit
Quantitative Laplace Principle
A Lower Bound for the Probability Mass $\rho_{Y,t}(B^\infty_{r}(x^*))$
Proof of Theorem \ref{['thm:global_convergence_main']}
Numerical Experiments
Implementational Aspects
...and 4 more sections

Key Result

Theorem 2

Let $T > 0$, $\rho_0 \in {\cal P}_4(\mathbb{R}^d\times\mathbb{R}^d)$. Let ${\cal E} : \mathbb{R}^d\rightarrow \mathbb{R}$ with $\underbar {\cal E} > -\infty$ satisfy for some constants $C_1,C_2 > 0$ the conditions and either $\sup_{x \in \mathbb{R}^d}{\cal E}(x) < \infty$ or for some $C_3,C_4 > 0$. Furthermore, in the case of an active gradient drift in the CBO dynamcis eq:CBO_macro_with_memory,

Figures (6)

Figure 1: A visualization of the CBO dynamics \ref{['eq:CBO_micro_with_memory']} with memory effects and gradient information. Particles with positions $X^1,\dots,X^N$ (yellow dots with their trajectories) explore the energy landscape of the objective ${\cal E}$ in search of the global minimizer $x^*$ (green star). Each particle stores its local historical best position $Y^i_t$ (yellow circles). The dynamics of the position $X^i_t$ of each particle is governed by three deterministic terms with associated random noise terms (visualized by depicting eight possible realizations with differently shaded green arrows). A global drift term (dark blue arrow) drags the particle towards the consensus point $y_\alpha(\widehat{\rho}_{Y,t}^N)$ (orange circle), which is computed as a weighted (visualized through color opacity) average of the particles' historical best positions. A local drift term (light blue arrow) imposes movement towards the respective local best position $Y^i_t$. A gradient drift term (purple arrow) exerts a force in the direction $-\nabla{\cal E}(X^i_t)$.
Figure 2: A demonstration of the benefits of memory effects and gradient information in CBO methods. In both settings (a) and (b) the depicted success probabilities are averaged over $100$ runs of CBO and the implemented scheme is given by a Euler-Maruyama discretization of Equation \ref{['eq:CBO_micro_with_memory']} with time horizon $T=20$, discrete time step size $\Delta t=0.01$, $\alpha=100$, $\beta=\infty$, $\theta=0$, $\kappa=1/\Delta t$, $\lambda_1=1$ and $\sigma_1=\sqrt{1.6}$. In (a) we plot the success probability of CBO without (left separate column) and with (right phase diagram) memory effects for different values of the parameter $\lambda_2$, i.e., for different strengths of the memory drift, when optimizing the Rastrigin function ${\cal E}(x) = \sum_{k=1}^d x_k^2 + \frac{5}{2} (1-\cos(2\pi x_k))$ in dimension $d=4$. As remaining parameters we choose $\sigma_2=\lambda_1\sigma_1$ and $\lambda_3=\sigma_3=0$, i.e., no gradient information is involved. We observe that an increasing amount of memory drift improves the success probability significantly, even in the case where, theoretically, there are no convergence guarantees anymore, see Theorem \ref{['thm:global_convergence_main']} and Corollary \ref{['cor:global_convergence_main']}. Section \ref{['sec:numerics:Rastrigin']} provides further details. In (b) we depict the success probability of CBO without (left separate column) and with (right phase diagram) gradient information for different values of the parameter $\lambda_3$, i.e., for different strengths of the gradient drift, when solving a compressed sensing problem in dimension $d=200$ with sparsity $s=8$. On the vertical axis we depict the number of measurements $m$, from which we try to recover the sparse signal by solving the associated $\ell_1$-regularized problem (LASSO). As remaining parameters we use merely $N=10$ particles, choose $\sigma_3=0$ and $\lambda_2=\sigma_2=0$, i.e., no memory drift is involved. We observe that gradient information is required to be able to identify the correct sparse solution and standard CBO would fail in such task. Section \ref{['sec:numerics:CS']} provides more details.
Figure 3: Success probability of CBO without (left separate column) and with memory effects for different values of the parameter $\lambda_2\in[0,4]$ (right phase diagram) when optimizing the Rastrigin function in dimension $d=4$ in the setting of Figure \ref{['fig:benefits_memory']} with the exception of setting $\sigma_2=0$. In this way we validate that the presence of memory effects is responsible for the improved performance and not just a higher noise level.
Figure 4: NN architectures used in the experiments of Section \ref{['sec:numerics:NN']}. Images are represented as $28\times28$ matrices with entries in $[0,1]$. For the shallow NN in (a) the input is reshaped into a vector $x\in\mathbb{R}^{728}$ which is then passed through a dense layer of the form $\mathrm{ReLU}(Wx+b)$ with trainable weights $W\in\mathbb{R}^{10\times728}$ and bias $b\in\mathbb{R}^{10}$. The learnable parameters of the CNN in (b) are the kernels and the final dense layer. Both networks include a batch normalization step after each $\mathrm{ReLU}$ activation function and a softmax activation in the last layer in order to be able to interpret the output as a probability distribution over the digits. We denote the trainable parameters of the NN by $\theta$. The shallow NN has $7850$ and the CNN $2112$. (Reprinted by permission from Springer Nature Customer Service Centre GmbH: Springer Nature, Applications of Evolutionary Computation, Convergence of Anisotropic Consensus-Based Optimization in Mean-Field Law, M. Fornasier, T. Klock, K. Riedl, © 2022.)
Figure 5: Comparison of the performances (testing accuracy and training loss) of a shallow NN (dashed lines) and a CNN (solid lines) with architectures as described in Figure \ref{['fig:architectures']}, when trained with CBO without memory effects (lightest lines), with memory effects but without memory drift (line with intermediate opacity) and with memory effects and memory drift (darkest lines). Depicted are the accuracies on a test dataset (orange lines) and the values of the objective function ${\cal E}$ evaluated on a random sample of the training set of size $10000$ (blue lines). We observe that memory effects slightly improve the final accuracies while slowing down the training process initially.
...and 1 more figures

Theorems & Definitions (22)

Definition 1
Theorem 2
Remark 3
proof : Proof sketch of Theorem \ref{['thm:well-posedness_FP']}
Definition 4: Assumptions
Theorem 5
Corollary 6
Remark 7
Lemma 8
proof
...and 12 more

Leveraging Memory Effects and Gradient Information in Consensus-Based Optimization: On Global Convergence in Mean-Field Law

TL;DR

Abstract

Leveraging Memory Effects and Gradient Information in Consensus-Based Optimization: On Global Convergence in Mean-Field Law

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (22)