Table of Contents
Fetching ...

A Trust-Region Method for Graphical Stein Variational Inference

Liam Pavlovic, David M. Rosen

TL;DR

This paper proposes a novel trust-region optimization approach for SVI that builds upon prior work in SVI by leveraging conditional independences in the target distribution and second-order information, while additionally providing an effective adaptive step control procedure, which is essential for ensuring convergence on challenging non-convex optimization problems.

Abstract

Stein variational inference (SVI) is a sample-based approximate Bayesian inference technique that generates a sample set by jointly optimizing the samples' locations to minimize an information-theoretic measure of discrepancy with the target probability distribution. SVI thus provides a fast and significantly more sample-efficient approach to Bayesian inference than traditional (random-sampling-based) alternatives. However, the optimization techniques employed in existing SVI methods struggle to address problems in which the target distribution is high-dimensional, poorly-conditioned, or non-convex, which severely limits the range of their practical applicability. In this paper, we propose a novel trust-region optimization approach for SVI that successfully addresses each of these challenges. Our method builds upon prior work in SVI by leveraging conditional independences in the target distribution (to achieve high-dimensional scaling) and second-order information (to address poor conditioning), while additionally providing an effective adaptive step control procedure, which is essential for ensuring convergence on challenging non-convex optimization problems. Experimental results show our method achieves superior numerical performance, both in convergence rate and sample accuracy, and scales better in high-dimensional distributions, than previous SVI techniques.

A Trust-Region Method for Graphical Stein Variational Inference

TL;DR

This paper proposes a novel trust-region optimization approach for SVI that builds upon prior work in SVI by leveraging conditional independences in the target distribution and second-order information, while additionally providing an effective adaptive step control procedure, which is essential for ensuring convergence on challenging non-convex optimization problems.

Abstract

Stein variational inference (SVI) is a sample-based approximate Bayesian inference technique that generates a sample set by jointly optimizing the samples' locations to minimize an information-theoretic measure of discrepancy with the target probability distribution. SVI thus provides a fast and significantly more sample-efficient approach to Bayesian inference than traditional (random-sampling-based) alternatives. However, the optimization techniques employed in existing SVI methods struggle to address problems in which the target distribution is high-dimensional, poorly-conditioned, or non-convex, which severely limits the range of their practical applicability. In this paper, we propose a novel trust-region optimization approach for SVI that successfully addresses each of these challenges. Our method builds upon prior work in SVI by leveraging conditional independences in the target distribution (to achieve high-dimensional scaling) and second-order information (to address poor conditioning), while additionally providing an effective adaptive step control procedure, which is essential for ensuring convergence on challenging non-convex optimization problems. Experimental results show our method achieves superior numerical performance, both in convergence rate and sample accuracy, and scales better in high-dimensional distributions, than previous SVI techniques.

Paper Structure

This paper contains 25 sections, 1 theorem, 34 equations, 5 figures, 2 tables, 3 algorithms.

Key Result

Theorem 1

Along a pair of directions $V, W \in \mathcal{H}_1 \times ... \times \mathcal{H}_D$ the second variation is The inner products here are between the functions $h_{ab}$, $w_b, v_a$ in Hilbert spaces. $x$ and $y$ are free variables only included to show which functions share which inputs.

Figures (5)

  • Figure 1: The convergence rate as a function of iteration number (a) and compute time (b) of each SVI method on the small SNLP instance. All second-order methods show fast, smooth convergence both MP-SVGD variants oscillate until their step size decays enough to enable convergence. Note that, unlike the other methods, SVN-CTR does not use local kernels to compute the gradient (see Eqs. \ref{['phi-max']} and \ref{['mp-grad']}). Since the estimated gradients depend upon the choice of kernel, SVN-CTR's gradient magnitude values are not directly comparable.
  • Figure 2: Kernel density estimation (KDE) plots of the final samples produced by various variational inference methods on a high-dimensional, noisy SNLP problem. From each sample, the marginal samples corresponding to the location of a selected sensor are extracted and visualized as a KDE plot. Since ground truth was not recoverable, we also visualize the measurements received by each selected sensor to enable qualitative analysis. These measurements are displayed as orange circles with a radius equal to the range measurement centered on the true position of the sending node. The time to generate the sample (in seconds) is displayed under its name.
  • Figure 3: Graph representations of the sensor network localization problems used for evaluation with the small example on the left and the large example on the right. Estimated nodes are depicted in blue and anchors in orange. The edges represent shared range measurements between pairs of nodes. Blue edges correspond to measurements shared between two estimated nodes and orange edges correspond to measurements from an anchor.
  • Figure 4: The convergence rates of MP-SVGD(a) and SVN(b) on the small SNLP instance with a variety of static step sizes. None produce good results.
  • Figure 5: Kernel density estimation (KDE) plots of the final samples produced by various SVI methods and the dynesty reference on a low-dimensional SNLP problem. From each sample, the marginal samples corresponding to the location of the selected sensor are extracted and visualized as a KDE plot. The KDE plots of the different methods are displayed on the same scale with the exception of SVN-CTR's plot for sensor A, which required a scale an order of magnitude smaller to be visible. The time required to generate each sample is displayed under its name.

Theorems & Definitions (1)

  • Theorem 1