Table of Contents
Fetching ...

Variational Inference with Mixtures of Isotropic Gaussians

Marguerite Petit-Talamon, Marc Lambert, Anna Korba

TL;DR

This work introduces a variational-inference framework based on mixtures of isotropic Gaussians with uniform weights to approximate multimodal posteriors efficiently. It develops two geometric optimization schemes on the isotropic-Gaussian manifold—Bures-Wasserstein gradient descent and entropic mirror descent—then extends them to a multi-component MIG family with joint mean and variance updates. The paper provides explicit gradient formulas, a JKO-inspired scheme for mixtures, and two practical update rules (IBW and MD) that preserve variance positivity and enable scalable inference. Empirical results on synthetic mixtures and Bayesian posterior tasks demonstrate that increasing the number of components improves multimodal coverage with modest computational overhead, outperforming several baselines in terms of accuracy and efficiency. The approach offers a compelling balance between expressivity and tractability for variational inference in multimodal settings, with directions for theoretical guarantees and broader applicability.

Abstract

Variational inference (VI) is a popular approach in Bayesian inference, that looks for the best approximation of the posterior distribution within a parametric family, minimizing a loss that is typically the (reverse) Kullback-Leibler (KL) divergence. In this paper, we focus on the following parametric family: mixtures of isotropic Gaussians (i.e., with diagonal covariance matrices proportional to the identity) and uniform weights. We develop a variational framework and provide efficient algorithms suited for this family. In contrast with mixtures of Gaussian with generic covariance matrices, this choice presents a balance between accurate approximations of multimodal Bayesian posteriors, while being memory and computationally efficient. Our algorithms implement gradient descent on the location of the mixture components (the modes of the Gaussians), and either (an entropic) Mirror or Bures descent on their variance parameters. We illustrate the performance of our algorithms on numerical experiments.

Variational Inference with Mixtures of Isotropic Gaussians

TL;DR

This work introduces a variational-inference framework based on mixtures of isotropic Gaussians with uniform weights to approximate multimodal posteriors efficiently. It develops two geometric optimization schemes on the isotropic-Gaussian manifold—Bures-Wasserstein gradient descent and entropic mirror descent—then extends them to a multi-component MIG family with joint mean and variance updates. The paper provides explicit gradient formulas, a JKO-inspired scheme for mixtures, and two practical update rules (IBW and MD) that preserve variance positivity and enable scalable inference. Empirical results on synthetic mixtures and Bayesian posterior tasks demonstrate that increasing the number of components improves multimodal coverage with modest computational overhead, outperforming several baselines in terms of accuracy and efficiency. The approach offers a compelling balance between expressivity and tractability for variational inference in multimodal settings, with directions for theoretical guarantees and broader applicability.

Abstract

Variational inference (VI) is a popular approach in Bayesian inference, that looks for the best approximation of the posterior distribution within a parametric family, minimizing a loss that is typically the (reverse) Kullback-Leibler (KL) divergence. In this paper, we focus on the following parametric family: mixtures of isotropic Gaussians (i.e., with diagonal covariance matrices proportional to the identity) and uniform weights. We develop a variational framework and provide efficient algorithms suited for this family. In contrast with mixtures of Gaussian with generic covariance matrices, this choice presents a balance between accurate approximations of multimodal Bayesian posteriors, while being memory and computationally efficient. Our algorithms implement gradient descent on the location of the mixture components (the modes of the Gaussians), and either (an entropic) Mirror or Bures descent on their variance parameters. We illustrate the performance of our algorithms on numerical experiments.

Paper Structure

This paper contains 49 sections, 8 theorems, 145 equations, 13 figures, 1 table, 1 algorithm.

Key Result

Proposition 2.1

Suppose that $\nabla^2 V \succeq \alpha \mathrm{I}_d$ for some $\alpha \in \mathbb{R}$. Then, for any $p_0 \in \mathrm{IG}$, there is a unique solution $(p_t)_{t}$ to the flow obtained as a limit of Eq (eq:JKO) as $\gamma\to0$. Then, for all $t \geq 0$ and $p^*\in \mathrm{IG}$, implying that the flow converges linearly when $\alpha>0$.

Figures (13)

  • Figure 1: Illustration of convergence of \ref{['alg:mirror_descent']} for a two-dimensional target distribution.
  • Figure 2: (First 10) Marginals (left) and KL objective (right) for MD, IBW, BW and NGD $(d=20)$.
  • Figure 3: Time/KL evolution w.r.t. $d$.
  • Figure 4: Bayesian logistic regression and BNN regression approximated by mixtures of Gaussians.
  • Figure 5: Optimal transport plan between two mixtures of two Gaussians with equal weights $\frac{1}{2}$. On the left is the Wasserstein distance $\mathop{\mathrm{W}}\nolimits_2$ case, where the optimal transport plan is not constrained. On the right is the $\mathrm{MW}_2$ case, where the optimal transport plan is constrained to be a mixture of Gaussians. In the $\mathop{\mathrm{W}}\nolimits_2$ case, there exists a bijective map. We don’t have such a bijective map for the $\mathrm{MW}_2$ case. Indeed, in the right figure, some points have two images. These figures are generated using the Python Optimal Transport library (https://pythonot.github.io/).
  • ...and 8 more figures

Theorems & Definitions (19)

  • Proposition 2.1
  • Definition 2.2
  • Proposition 3.1
  • Remark 4.1
  • Proposition A.1
  • proof
  • Definition A.2
  • Remark A.3
  • Lemma A.4
  • proof
  • ...and 9 more