Table of Contents
Fetching ...

Kernel Semi-Implicit Variational Inference

Ziheng Cheng, Longlin Yu, Tianyu Xie, Shiyue Zhang, Cheng Zhang

TL;DR

Kernel Semi-Implicit Variational Inference (KSIVI) tackles intractable densities in semi-implicit variational families by deriving an explicit RKHS-based solution to the inner optimization, turning training into minimizing the kernel Stein discrepancy $KSD(q_phi || p)$. By leveraging the hierarchical form $q_phi(x)=\int q_phi(x|z) q(z) dz$ and a kernel trick, KSIVI avoids inner-loop optimization while maintaining a computable objective and unbiased MC gradient estimates. The authors prove a gradient variance bound and establish convergence to a stationary point under mild smoothness and moment assumptions, with extensive experiments on toy distributions, Bayesian logistic regression, conditioned diffusion processes, and Bayesian neural networks. Empirically KSIVI achieves competitive or superior performance to SIVI-SM with improved stability and less hyperparameter tuning, underscoring its practical relevance for scalable Bayesian inference in complex models.

Abstract

Semi-implicit variational inference (SIVI) extends traditional variational families with semi-implicit distributions defined in a hierarchical manner. Due to the intractable densities of semi-implicit distributions, classical SIVI often resorts to surrogates of evidence lower bound (ELBO) that would introduce biases for training. A recent advancement in SIVI, named SIVI-SM, utilizes an alternative score matching objective made tractable via a minimax formulation, albeit requiring an additional lower-level optimization. In this paper, we propose kernel SIVI (KSIVI), a variant of SIVI-SM that eliminates the need for lower-level optimization through kernel tricks. Specifically, we show that when optimizing over a reproducing kernel Hilbert space (RKHS), the lower-level problem has an explicit solution. This way, the upper-level objective becomes the kernel Stein discrepancy (KSD), which is readily computable for stochastic gradient descent due to the hierarchical structure of semi-implicit variational distributions. An upper bound for the variance of the Monte Carlo gradient estimators of the KSD objective is derived, which allows us to establish novel convergence guarantees of KSIVI. We demonstrate the effectiveness and efficiency of KSIVI on both synthetic distributions and a variety of real data Bayesian inference tasks.

Kernel Semi-Implicit Variational Inference

TL;DR

Kernel Semi-Implicit Variational Inference (KSIVI) tackles intractable densities in semi-implicit variational families by deriving an explicit RKHS-based solution to the inner optimization, turning training into minimizing the kernel Stein discrepancy . By leveraging the hierarchical form and a kernel trick, KSIVI avoids inner-loop optimization while maintaining a computable objective and unbiased MC gradient estimates. The authors prove a gradient variance bound and establish convergence to a stationary point under mild smoothness and moment assumptions, with extensive experiments on toy distributions, Bayesian logistic regression, conditioned diffusion processes, and Bayesian neural networks. Empirically KSIVI achieves competitive or superior performance to SIVI-SM with improved stability and less hyperparameter tuning, underscoring its practical relevance for scalable Bayesian inference in complex models.

Abstract

Semi-implicit variational inference (SIVI) extends traditional variational families with semi-implicit distributions defined in a hierarchical manner. Due to the intractable densities of semi-implicit distributions, classical SIVI often resorts to surrogates of evidence lower bound (ELBO) that would introduce biases for training. A recent advancement in SIVI, named SIVI-SM, utilizes an alternative score matching objective made tractable via a minimax formulation, albeit requiring an additional lower-level optimization. In this paper, we propose kernel SIVI (KSIVI), a variant of SIVI-SM that eliminates the need for lower-level optimization through kernel tricks. Specifically, we show that when optimizing over a reproducing kernel Hilbert space (RKHS), the lower-level problem has an explicit solution. This way, the upper-level objective becomes the kernel Stein discrepancy (KSD), which is readily computable for stochastic gradient descent due to the hierarchical structure of semi-implicit variational distributions. An upper bound for the variance of the Monte Carlo gradient estimators of the KSD objective is derived, which allows us to establish novel convergence guarantees of KSIVI. We demonstrate the effectiveness and efficiency of KSIVI on both synthetic distributions and a variety of real data Bayesian inference tasks.
Paper Structure (37 sections, 11 theorems, 58 equations, 12 figures, 8 tables, 2 algorithms)

This paper contains 37 sections, 11 theorems, 58 equations, 12 figures, 8 tables, 2 algorithms.

Key Result

Theorem 3.1

For any variational distribution $q_\phi$, the solution $f^*$ to the lower-level optimization in (eq:kernel_minmax) takes the form Thus the upper-level optimization problem for $\phi$ is

Figures (12)

  • Figure 1: Performances of KSIVI on toy examples. The histplots in blue represent the estimated densities using 100,000 samples generated from KSIVI's variational approximation. The black lines depict the contour of the target distributions.
  • Figure 2: Convergence of KL divergence during training obtained by different methods on toy examples. The KL divergences are estimated using the Python ITE module ITE2014 with 100,000 samples. The results are averaged over 5 independent computations with the standard deviation as the shaded region.
  • Figure 3: Marginal and pairwise variational approximations of $\beta_2,\beta_3,\beta_4$ on the Bayesian logistic regression task. The contours of the pairwise posterior approximation produced by SIVI-SM (in orange), SIVI (in green), and KSIVI (in blue) are graphed in comparison to the ground truth (in black). The sample size is 1000.
  • Figure 4: Comparison between the estimated pairwise correlation coefficients and the ground truth on the Bayesian logistic regression task. Each scatter represents the estimated correlation coefficient ($y$-axis) and the ground truth correlation coefficient ($x$-axis) of some pair $(\beta_i,\beta_j)$. The lines in the same color as the scatters represent the regression lines. The sample size is 1000.
  • Figure 5: Variational approximations of different methods for the discretized conditioned diffusion process. The magenta trajectory represents the ground truth via parallel SGLD. The blue line corresponds to the estimated posterior mean of different methods, and the shaded region denotes the $95\%$ marginal posterior confidence interval at each time step. The sample size is 1000.
  • ...and 7 more figures

Theorems & Definitions (16)

  • Theorem 3.1
  • Proposition 3.2
  • Proposition 4.5
  • Theorem 4.6
  • Theorem 4.7
  • Theorem 4.8
  • Theorem 1.1
  • proof
  • Proposition 1.2
  • proof
  • ...and 6 more