Scalable Kernel-Based Distances for Statistical Inference and Integration

Masha Naslidnyk

Scalable Kernel-Based Distances for Statistical Inference and Integration

Masha Naslidnyk

TL;DR

A thorough study of kernel-based distances with a focus on efficient computation, and introduces a family of novel kernel-based discrepancies: kernel quantile discrepancies to address some of the pitfalls of MMD.

Abstract

Representing, comparing, and measuring the distance between probability distributions is a key task in computational statistics and machine learning. The choice of representation and the associated distance determine properties of the methods in which they are used: for example, certain distances can allow one to encode robustness or smoothness of the problem. Kernel methods offer flexible and rich Hilbert space representations of distributions that allow the modeller to enforce properties through the choice of kernel, and estimate associated distances at efficient nonparametric rates. In particular, the maximum mean discrepancy (MMD), a kernel-based distance constructed by comparing Hilbert space mean functions, has received significant attention due to its computational tractability and is favoured by practitioners. In this thesis, we conduct a thorough study of kernel-based distances with a focus on efficient computation, with core contributions in Chapters 3 to 6. Part I of the thesis is focused on the MMD, specifically on improved MMD estimation. In Chapter 3 we propose a theoretically sound, improved estimator for MMD in simulation-based inference. Then, in Chapter 4, we propose an MMD-based estimator for conditional expectations, a ubiquitous task in statistical computation. Closing Part I, in Chapter 5 we study the problem of calibration when MMD is applied to the task of integration. In Part II, motivated by the recent developments in kernel embeddings beyond the mean, we introduce a family of novel kernel-based discrepancies: kernel quantile discrepancies. These address some of the pitfalls of MMD, and are shown through both theoretical results and an empirical study to offer a competitive alternative to MMD and its fast approximations. We conclude with a discussion on broader lessons and future work emerging from the thesis.

Scalable Kernel-Based Distances for Statistical Inference and Integration

TL;DR

Abstract

Paper Structure (197 sections, 50 theorems, 417 equations, 26 figures, 7 tables, 1 algorithm)

This paper contains 197 sections, 50 theorems, 417 equations, 26 figures, 7 tables, 1 algorithm.

Introduction
From tests to measures of discrepancy
The MMD and its applications
MMD for integration: kernel and Bayesian quadrature
Challenges and contributions
Challenge 1: Specialising MMD-based methods.
Challenge 2: Alternative kernel-based distances.
Background
Discrepancy vs. divergence vs. distance.
Kernels and reproducing kernel Hilbert spaces
Examples and basic properties of kernels
Matérn kernels
Polynomial kernels
The Brownian motion kernel
Reproducing kernel Hilbert spaces
...and 182 more sections

Key Result

Lemma 1

Any Sobolev kernel $k$ is strictly positive definite.

Figures (26)

Figure 1: Estimating the MMD requires approximating the embedding $\mu_{k,\mathbb{P}_\theta}$ of the model $\mathbb{P}_\theta$ in the RKHS $\mathcal{H}_k$. The classical approach approximates it using $N$equally-weighted i.i.d. samples from $\mathbb{P}_\theta$, denoted $\mu_{k,\mathbb{P}_{\theta,N}}$. We show that this estimator can be improved by using optimally-weighted samples, denoted $\mu_{k,\mathbb{P}^w_{\theta,N}}$.
Figure 2: Error in estimating MMD$^2$ for the multivariate g-and-k distribution. (a) Error of our OW estimator for different choices of $k$ and $c$. Increasing the smoothness of $k$ improves the performance. (b) Comparison of V-statistic and OW estimator as a function of dimension. OW performs better for both parameterisations of $\mathbb{U}$, with the Gaussian giving lowest error. (c) Value of $\theta_4$ also impacts the performance of the OW estimator. (d) Error vs. total computation cost for different $M$. OW performs better than V-statistic for similar cost: $N=M$ for V-statistic, whereas $N = (68, 126, 200, 317)$ for OW.
Figure 3: ABC posteriors for the wind farm model. Our OW estimator yields posterior samples that are more concentrated around the true $\theta_0$ than the V-statistic. Settings: $M=100$, $\theta_0=20$.
Figure 4: Illustration of conditional Bayesian quadrature (CBQ). Stage one fits a GP to $f(x,\theta_t)$ for each $\theta_t \in \{\theta_1,\dots,\theta_T\}$ and integrates to obtain BQ estimates $I_{\mathrm{BQ}}(\theta_1),\dots, I_{\mathrm{BQ}}(\theta_T)$. Stage two places a GP over $\theta\mapsto I(\theta)$ and fuses these to yield $I_{\mathrm{CBQ}}(\theta)$ with posterior uncertainty shown by shaded regions.
Figure 5: Bayesian sensitivity analysis for linear models.Left: RMSE of all methods when $d=2$ and $N=50$. Middle: RMSE of all methods when $d=2$ and $T=50$. Right: RMSE of all methods when $N=T=100$.
...and 21 more figures

Theorems & Definitions (126)

Definition 1: Positive definite kernel
Remark 1
Definition 2
Definition 3
Remark 2: Equivalent definitions of an RKHS
Definition 4: Reproducing kernel Hilbert spaces
Remark 3
Definition 5: Kernel mean embedding
Definition 6: (Mean-)characteristic kernel
Definition 7: Maximum Mean Discrepancy
...and 116 more

Scalable Kernel-Based Distances for Statistical Inference and Integration

TL;DR

Abstract

Scalable Kernel-Based Distances for Statistical Inference and Integration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (26)

Theorems & Definitions (126)