Table of Contents
Fetching ...

Efficient Mixture Learning in Black-Box Variational Inference

Alexandra Hotti, Oskar Kviman, Ricky Molén, Víctor Elvira, Jens Lagergren

TL;DR

This work tackles the scalability and efficiency bottlenecks of mixture distributions in black-box variational inference by introducing MISVAE, a two-network architecture that amortizes mixture parameterization with shared weights via one-hot encodings. It pairs MISVAE with two novel MIS-based estimators, Some-to-All ($S2A$) and Some-to-Some ($S2S$), to dramatically reduce inference time while preserving or improving variational performance, enabling hundreds of mixture components. The approach achieves state-of-the-art marginal log-likelihoods on MNIST and FashionMNIST with far fewer parameters than existing mixtures and also reduces inference time in Bayesian phylogenetics (VBPI) across multiple datasets. Together, MISVAE and the estimators substantially expand the practical feasibility of large-mixture BBVI in both vision and structured-domain applications, offering a scalable path for rich posterior approximations.

Abstract

Mixture variational distributions in black box variational inference (BBVI) have demonstrated impressive results in challenging density estimation tasks. However, currently scaling the number of mixture components can lead to a linear increase in the number of learnable parameters and a quadratic increase in inference time due to the evaluation of the evidence lower bound (ELBO). Our two key contributions address these limitations. First, we introduce the novel Multiple Importance Sampling Variational Autoencoder (MISVAE), which amortizes the mapping from input to mixture-parameter space using one-hot encodings. Fortunately, with MISVAE, each additional mixture component incurs a negligible increase in network parameters. Second, we construct two new estimators of the ELBO for mixtures in BBVI, enabling a tremendous reduction in inference time with marginal or even improved impact on performance. Collectively, our contributions enable scalability to hundreds of mixture components and provide superior estimation performance in shorter time, with fewer network parameters compared to previous Mixture VAEs. Experimenting with MISVAE, we achieve astonishing, SOTA results on MNIST. Furthermore, we empirically validate our estimators in other BBVI settings, including Bayesian phylogenetic inference, where we improve inference times for the SOTA mixture model on eight data sets.

Efficient Mixture Learning in Black-Box Variational Inference

TL;DR

This work tackles the scalability and efficiency bottlenecks of mixture distributions in black-box variational inference by introducing MISVAE, a two-network architecture that amortizes mixture parameterization with shared weights via one-hot encodings. It pairs MISVAE with two novel MIS-based estimators, Some-to-All () and Some-to-Some (), to dramatically reduce inference time while preserving or improving variational performance, enabling hundreds of mixture components. The approach achieves state-of-the-art marginal log-likelihoods on MNIST and FashionMNIST with far fewer parameters than existing mixtures and also reduces inference time in Bayesian phylogenetics (VBPI) across multiple datasets. Together, MISVAE and the estimators substantially expand the practical feasibility of large-mixture BBVI in both vision and structured-domain applications, offering a scalable path for rich posterior approximations.

Abstract

Mixture variational distributions in black box variational inference (BBVI) have demonstrated impressive results in challenging density estimation tasks. However, currently scaling the number of mixture components can lead to a linear increase in the number of learnable parameters and a quadratic increase in inference time due to the evaluation of the evidence lower bound (ELBO). Our two key contributions address these limitations. First, we introduce the novel Multiple Importance Sampling Variational Autoencoder (MISVAE), which amortizes the mapping from input to mixture-parameter space using one-hot encodings. Fortunately, with MISVAE, each additional mixture component incurs a negligible increase in network parameters. Second, we construct two new estimators of the ELBO for mixtures in BBVI, enabling a tremendous reduction in inference time with marginal or even improved impact on performance. Collectively, our contributions enable scalability to hundreds of mixture components and provide superior estimation performance in shorter time, with fewer network parameters compared to previous Mixture VAEs. Experimenting with MISVAE, we achieve astonishing, SOTA results on MNIST. Furthermore, we empirically validate our estimators in other BBVI settings, including Bayesian phylogenetic inference, where we improve inference times for the SOTA mixture model on eight data sets.
Paper Structure (32 sections, 8 theorems, 34 equations, 14 figures, 4 tables)

This paper contains 32 sections, 8 theorems, 34 equations, 14 figures, 4 tables.

Key Result

Theorem 4.1

The Some-to-All estimator is an unbiased estimator of Eq. eq:miselbo.

Figures (14)

  • Figure 1: SOTA Performance with Small and Efficient Networks: NLL values for MISVAE trained with the $\mathrm{S2A}$ estimator with $S=1$ and a gradually increasing $A$.
  • Figure 2: Block diagram depicting the estimation of MISELBO using MISVAE with the $\mathrm{S2S}$ estimator, with $S=2$ and $A=3$. First, $f_{D2\mathcal{H}}$ maps the data to an intermediate hidden space, producing a representation $h$. The next network, $f_\phi$, takes $h$ along with $S$$A$-dimensional one-hot encodings, acting as signals of the $S$ mixtures used by the $\text{S2S}$ estimator, as input, which are then mapped to the variational parameters, here $\phi_1$ and $\phi_2$, of the mixture components. Samples drawn from the $S$ mixtures are then passed to a decoding network to produce the parameters $\theta$ of the generative model. Collectively, the sampled latent variables, the variational parameters, and $\theta$ , are used to compute $\widetilde{\mathcal{L}}_{\text{S2S}}$. The diagram is explained in detail in Sec. \ref{['sec:misvae']}. Corresponding diagrams for the $\mathrm{S2A}$ and $\mathrm{A2A}$ estimators can be found in Fig. \ref{['fig:misvae-architecture-s2a-a2a']}.
  • Figure 3: Comparison of MISELBO approximation performance and training runtimes across three distinct estimators under various settings of $S$ and $A$ in the Toy Experiment, trained for $50,000$ epochs.
  • Figure 4: Results on MNIST for MISVAE trained with various combinations of $S$ and $A$, with the $\mathrm{S2S}$ estimator (top row) and the $\mathrm{S2A}$ estimator (bottom row). (a) Average (solid) NLL results computed over three runs with one standard deviation (opaque) displayed, (b) training time per epoch, and (c) the number of network parameters for MISVAE for increasing values of $A$. Using $\text{MISVAE}$, the number of network parameters increases by a small amount as we increase $A$. Also, with the $\text{S2S}$ estimator, we can keep $S$ fixed and increase $A$, without impacting the number of seconds needed to complete an epoch and simultaneously improving the NLL. For $\text{S2A}$, we converge to an equivalent solution with $A$ held fixed for any $S < A$, meaning that in practice, we can scale up $A$ for small values of $S$ at a small extra computational cost per mixture component.
  • Figure 5: Comparison between SEMVAE and MISVAE using the S2S, A2A, and S2A estimators on MNIST: (a) NLL scores for increasing values of $A$, (b) training time per epoch, and (c) number of hyperparameters for increasing $A$ for SEMVAE compared to MISVAE. Note: The green curve represents the performance of MISVAE using the S2S estimator with $S$ increasing , such that S=A on the x-axis, while $A$ is held fixed at $50$.
  • ...and 9 more figures

Theorems & Definitions (16)

  • Theorem 4.1
  • proof
  • Corollary 4.2
  • Theorem 4.3
  • proof
  • Theorem 4.4
  • proof
  • Corollary 4.5
  • proof
  • proof
  • ...and 6 more