Table of Contents
Fetching ...

Mixed Variational Flows for Discrete Variables

Gian Carlo Diluvi, Benjamin Bloem-Reddy, Trevor Campbell

TL;DR

This work tackles the challenge of variational inference for discrete distributions by eschewing continuous embeddings and introducing MAD Mix, a measure-preserving, discrete variational family built from a MAD map that augments the target with uniform variables. MAD Mix enables i.i.d. sampling and exact density evaluation while preserving the target distribution, and extends to joint discrete-continuous models through a combined map with discretized Hamiltonian dynamics. The authors provide theoretical guarantees (invertibility, density of pushforward, measure-preservation) and demonstrate through experiments that MAD Mix yields high-fidelity discrete approximations with substantially faster training and more stable behavior than continuous-embedding flows, as well as direct comparability to Gibbs sampling in sampling quality. The approach offers a practical, scalable alternative for discrete and mixed-variable Bayesian models with reliable density-based evaluation via ELBO.

Abstract

Variational flows allow practitioners to learn complex continuous distributions, but approximating discrete distributions remains a challenge. Current methodologies typically embed the discrete target in a continuous space - usually via continuous relaxation or dequantization - and then apply a continuous flow. These approaches involve a surrogate target that may not capture the original discrete target, might have biased or unstable gradients, and can create a difficult optimization problem. In this work, we develop a variational flow family for discrete distributions without any continuous embedding. First, we develop a measure-preserving and discrete (MAD) invertible map that leaves the discrete target invariant, and then create a mixed variational flow (MAD Mix) based on that map. Our family provides access to i.i.d. sampling and density evaluation with virtually no tuning effort. We also develop an extension to MAD Mix that handles joint discrete and continuous models. Our experiments suggest that MAD Mix produces more reliable approximations than continuous-embedding flows while being significantly faster to train.

Mixed Variational Flows for Discrete Variables

TL;DR

This work tackles the challenge of variational inference for discrete distributions by eschewing continuous embeddings and introducing MAD Mix, a measure-preserving, discrete variational family built from a MAD map that augments the target with uniform variables. MAD Mix enables i.i.d. sampling and exact density evaluation while preserving the target distribution, and extends to joint discrete-continuous models through a combined map with discretized Hamiltonian dynamics. The authors provide theoretical guarantees (invertibility, density of pushforward, measure-preservation) and demonstrate through experiments that MAD Mix yields high-fidelity discrete approximations with substantially faster training and more stable behavior than continuous-embedding flows, as well as direct comparability to Gibbs sampling in sampling quality. The approach offers a practical, scalable alternative for discrete and mixed-variable Bayesian models with reliable density-based evaluation via ELBO.

Abstract

Variational flows allow practitioners to learn complex continuous distributions, but approximating discrete distributions remains a challenge. Current methodologies typically embed the discrete target in a continuous space - usually via continuous relaxation or dequantization - and then apply a continuous flow. These approaches involve a surrogate target that may not capture the original discrete target, might have biased or unstable gradients, and can create a difficult optimization problem. In this work, we develop a variational flow family for discrete distributions without any continuous embedding. First, we develop a measure-preserving and discrete (MAD) invertible map that leaves the discrete target invariant, and then create a mixed variational flow (MAD Mix) based on that map. Our family provides access to i.i.d. sampling and density evaluation with virtually no tuning effort. We also develop an extension to MAD Mix that handles joint discrete and continuous models. Our experiments suggest that MAD Mix produces more reliable approximations than continuous-embedding flows while being significantly faster to train.
Paper Structure (36 sections, 6 theorems, 62 equations, 3 figures, 2 algorithms)

This paper contains 36 sections, 6 theorems, 62 equations, 3 figures, 2 algorithms.

Key Result

Theorem 2.1

If $T:\mathcal{X}\to\mathcal{X}$ is measure-preserving and ergodic for $\pi$, then for any $f\in L^1(\pi)$

Figures (3)

  • Figure 1: One application of the MAD map to the initial values $(x_0,u_0)=(2,0.75)$ with target probabilities $\pi={\sf{Categorical}}(0.1,0.4,0.4,0.1)$. Each plot contains the CDF of $\pi$. In the first plot, $u_0$ represents the proportion of mass between $x_0=2$ and $1$. In the second plot, $u_0$ is transformed into $\rho$---which indicates where in the CDF of $\pi$ the initial value $x_0$ lies at. Then, in the third plot, $\rho$ is shifted vertically by $\xi=0.45$ to $\tilde{\rho}$, which produces a new value $x_1$ via the inverse-CDF trick. Finally, $\tilde{\rho}$ gets transformed into $u_1$, which represents the proportion of mass between $x_1=3$ and $2$.
  • Figure 2: Summary of experiments. In \ref{['fig:norm_summary']}, the normalizing constant $Z$ of the target density is known while in \ref{['fig:unnorm_summary']}$Z$ is not tractable. The boxplots for continuous-embedding flows represent the search over different architecture settings. (Top row): KL divergence (\ref{['fig:norm_summary']}) and negative ELBO (\ref{['fig:unnorm_summary']}) from approximation to target distribution. Lower is better. (Bottom row): Compute time (seconds) to evaluate or estimate the density (\ref{['fig:norm_summary']}) or to generate a sample (\ref{['fig:unnorm_summary']}). The second set of boxplots for continuous-embedding flows show the time to evaluate a subsequent density point after training. Missing values indicate either that the algorithm cannot be used for that task or that it was too computationally unstable to produce results, except for 1D mean-field VI which produces exact results ($\mathrm{KL}=0$). Colors are shared across figures and $x$-axes across columns.
  • Figure 3: True and approximated PMFs of the examples with tractable normalizing constant. Concrete-relaxed flows are not shown in the 3D example since even the optimal approximation---across the architecture search---is very poor. For the Ising model, we treat each $x\in\mathcal{X}$ as an element of $\{0,1\}^M$ (i.e., a binary representation) and show the flattened PMF in ascending order of binary representation. The legend is shared across figures.

Theorems & Definitions (6)

  • Theorem 2.1: birkhoff1931ErgodicTheorem
  • Proposition 3.1
  • Proposition 3.2
  • Lemma G.1
  • Lemma G.2
  • Proposition G.3