General bounds on the quality of Bayesian coresets

Trevor Campbell

General bounds on the quality of Bayesian coresets

Trevor Campbell

TL;DR

General upper and lower bounds on the Kullback-Leibler (KL) divergence of coreset approximations are presented to obtain fundamental limitations on the quality of coreset approximations, and to provide a theoretical explanation for the previously-observed poor empirical performance of importance sampling-based construction methods.

Abstract

Bayesian coresets speed up posterior inference in the large-scale data regime by approximating the full-data log-likelihood function with a surrogate log-likelihood based on a small, weighted subset of the data. But while Bayesian coresets and methods for construction are applicable in a wide range of models, existing theoretical analysis of the posterior inferential error incurred by coreset approximations only apply in restrictive settings -- i.e., exponential family models, or models with strong log-concavity and smoothness assumptions. This work presents general upper and lower bounds on the Kullback-Leibler (KL) divergence of coreset approximations that reflect the full range of applicability of Bayesian coresets. The lower bounds require only mild model assumptions typical of Bayesian asymptotic analyses, while the upper bounds require the log-likelihood functions to satisfy a generalized subexponentiality criterion that is weaker than conditions used in earlier work. The lower bounds are applied to obtain fundamental limitations on the quality of coreset approximations, and to provide a theoretical explanation for the previously-observed poor empirical performance of importance sampling-based construction methods. The upper bounds are used to analyze the performance of recent subsample-optimize methods. The flexibility of the theory is demonstrated in validation experiments involving multimodal, unidentifiable, heavy-tailed Bayesian posterior distributions.

General bounds on the quality of Bayesian coresets

TL;DR

Abstract

Paper Structure (9 sections, 8 theorems, 25 equations, 3 figures, 3 algorithms)

This paper contains 9 sections, 8 theorems, 25 equations, 3 figures, 3 algorithms.

Introduction
Background
Lower bounds on approximation error
Lower bound applications
Minimum coreset size for importance-weighted coresets
Minimum coreset size for any coreset construction
Upper bound application: subsample-optimize coresets
Conclusions
Proofs

Key Result

Lemma 3.1

For all measurable $B\subseteq \Theta$ and coreset weights $w$, where

Figures (3)

Figure 1: Example unnormalized posterior densities given 50 data points for (\ref{['fig:cauchymodel']}) the Cauchy location model and (\ref{['fig:logregmodel']}) the logistic regression model. The orange and blue dashed lines in (\ref{['fig:logregmodel']}) indicate one-dimensional slices that are shown in the rightmost panels.
Figure 2: Importance-weighted coreset quality, showing the minimum of the forward and reverse KL divergences on the vertical axis as a function of dataset size $N$ for 3 coreset sizes: $\log N$ (black), ${\hbox{$\sqrt{N\,}$}}$ (blue), and $1/2N$ (red). Dashed lines indicate predictions from the theory in \ref{['cor:importanceweighted', 'cor:scaledimportanceweighted']}, solid lines indicate the mean over 10 trials, and error bars indicate standard error. The top row shows the quality of basic importance-weighted coresets (note that both horizontal and vertical axes are in log scale), while the bottom row shows the quality with optimal post-hoc scaling (note that only the horizontal axis is in log scale). The left column corresponds to the Cauchy location model, while the right column corresponds to the logistic regression model. Sampling probabilities $p_n$ for both models are set proportional to $X_n^2$, thresholded to lie between $0.1/N$ and $10/N$.
Figure 3: Subsample-optimize coreset quality, showing the maximum of the forward and reverse KL divergences on the vertical axis as a function of dataset size $N$ for coresets of size $5+2\log N$. Solid lines indicate the mean over 70 trials, and error bars indicate standard error. The left panel is for the Cauchy location model, while the right panel is for the logistic regression model. Sampling probabilities are uniform $p_n=1/N$, and coreset weights were optimized by nonnegative least squares for log-likelihoods discretized via samples from $\pi$Campbell19b.

Theorems & Definitions (8)

Lemma 3.1: Basic KL Lower Bound
Theorem 3.3
Theorem 3.5
Corollary 3.6
Corollary 4.1
Corollary 4.2
Corollary 4.3
Corollary 5.1

General bounds on the quality of Bayesian coresets

TL;DR

Abstract

General bounds on the quality of Bayesian coresets

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (8)