Natural Variational Annealing for Multimodal Optimization

Tâm LeMinh; Julyan Arbel; Thomas Möllenhoff; Mohammad Emtiyaz Khan; Florence Forbes

Natural Variational Annealing for Multimodal Optimization

Tâm LeMinh, Julyan Arbel, Thomas Möllenhoff, Mohammad Emtiyaz Khan, Florence Forbes

TL;DR

Natural Variational Annealing (NVA) offers a principled multimodal optimization framework by marrying variational approximations (via mixture search distributions), entropy-regularized annealing, and natural-gradient learning. It enables simultaneous exploration of multiple basins and progressively concentrates on high-value regions through a tempered objective, with variants for Gaussian mixtures (NVA-GM) and fitness shaping (FS-NVA-GM). Theoretical results show annealing concentrates the search on global modes and that Gaussian mixtures track modes as ω→0, while simulations and a planetary-inverse problem demonstrate robust mode-finding and practical applicability. Overall, the framework provides a flexible, tunable approach with clear trade-offs between exploration, convergence, and computational cost, and points to promising directions for scalability and adaptive scheduling.

Abstract

We introduce a new multimodal optimization approach called Natural Variational Annealing (NVA) that combines the strengths of three foundational concepts to simultaneously search for multiple global and local modes of black-box nonconvex objectives. First, it implements a simultaneous search by using variational posteriors, such as, mixtures of Gaussians. Second, it applies annealing to gradually trade off exploration for exploitation. Finally, it learns the variational search distribution using natural-gradient learning where updates resemble well-known and easy-to-implement algorithms. The three concepts come together in NVA giving rise to new algorithms and also allowing us to incorporate "fitness shaping", a core concept from evolutionary algorithms. We assess the quality of search on simulations and compare them to methods using gradient descent and evolution strategies. We also provide an application to a real-world inverse problem in planetary science.

Natural Variational Annealing for Multimodal Optimization

TL;DR

Abstract

Paper Structure (108 sections, 14 theorems, 125 equations, 8 figures, 6 tables, 4 algorithms)

This paper contains 108 sections, 14 theorems, 125 equations, 8 figures, 6 tables, 4 algorithms.

Abstract.
Keywords.
Introduction
Contributions.
Outline.
Natural variational annealing with mixtures
Natural parameterization of mixtures
Natural gradient update rules
Estimation of the natural gradients
Stochastic natural gradient ascent
The NVA-M algorithm
Special case of Gaussian mixtures (NVA-GM)
Annealing properties of Gaussian mixtures variational approximation
Single Gaussian behavior.
Mixture behavior.
...and 93 more sections

Key Result

Proposition 2.1

The solution of eq:problem_fixed_omega is given by the Gibbs measure $g_\omega$ defined by eq:gibbs_measure.

Figures (8)

Figure 1: Symmetric mixture: The first graph shows that FS-NVA-GM with CMA-ES utility values almost always locates distinct global modes of the symmetric mixture log-density. For $K \ge 3$, it almost always finds all three of them. The second graph indicates that the identified modes are distinct as long as there are at most as many components as (global and local) modes. NVA-GM performs similarly, except when $K \le 3$, where it sometimes misses a global mode, returning the local mode instead. Parallel CMA-ES and parallel SGA struggle more to find distinct modes. The dashed line represents the maximum performance achievable for given values of $K$.
Figure 2: Styblinski--Tang: The first graph shows that both NVA-GM and FS-NVA-GM with CMA-ES utility values almost always locate the unique global mode of the 4-dimensional Styblinski--Tang's function for all values of $K \ge 2$. The second graph indicates that the identified modes are distinct most of the time, except when $K$ is close to the total number of modes (16). From this point of view, NVA-GM performs slightly better than FS-NVA-GM due to the latter's greater tendency to favor convergence to the global mode over the 15 other local modes. Parallel CMA-ES and parallel SGA struggle more to find the global mode, as well as diverse local modes, although parallel CMA-ES matches NVA-GM and FS-NVA-GM's performance when $K \ge 8$ in finding the global mode. The dashed line represents the maximum performance achievable for given values of $K$.
Figure 3: Left: The trajectories of the means in a run of NVA-GM on the log-density of the symmetric mixture with $K = 4$ show that the four components track different modes. The contour plot represents non-equally spaced levels of $\ell$. Right: As expected, the weights of components tracking global modes (1, 3, 4) converge to $1/3$, whereas the weight of the component tracking the local mode (2) vanishes.
Figure 4: Left: The trajectories of the means in a run of NVA-GM on the log-density of the asymmetric mixture with $K = 3$ show that the three components track different modes. The contour plot represents non-equally spaced levels of $\ell$. Right: As expected, the weights of components tracking global modes (1, 3) converge to their respective limits $\tilde{c}_1 \approx 0.586$ and $\tilde{c}_3 \approx 0.414$, whereas the weight of the component tracking the local mode (2) vanishes.
Figure 5: Remote sensing illustration: BRDF signal reconstructions. The Nontronite observed signal $\mathbf{y}_{\text{o}}$ (black) is compared to the reconstructed signals from the 4 modes found by NVA-GM, ${\boldsymbol{\Psi}}\xspace_1$ (red), ${\boldsymbol{\Psi}}\xspace_2$ (blue), ${\boldsymbol{\Psi}}\xspace_3$ (green) and ${\boldsymbol{\Psi}}\xspace_4$ (purple). The black dashed lines show a band of $\pm 0.05$ around $\mathbf{y}_{\text{o}}$. The red dashed line shows for comparison the signal in the training set with the highest correlation to $\mathbf{y}_{\text{o}}$.
...and 3 more figures

Theorems & Definitions (25)

Proposition 2.1: kullback1959informationdonsker1976asymptotic
Theorem 2.2: Annealed Gibbs measure
Proposition 4.1
Proposition A.1
proof
proof : Proof of Proposition \ref{['prop:solution_fixed_omega']}
Theorem C.1: Laplace's theorem
Lemma C.3
proof
Lemma C.4
...and 15 more

Natural Variational Annealing for Multimodal Optimization

TL;DR

Abstract

Natural Variational Annealing for Multimodal Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (25)