Table of Contents
Fetching ...

On the Robustness of Kernel Goodness-of-Fit Tests

Xing Liu, François-Xavier Briol

TL;DR

The paper tackles the robustness gap in kernel-based GOF testing by introducing a robust kernel GOF framework built on kernel Stein discrepancy (KSD) balls. It analyzes the lack of robustness of standard KSD tests with stationary kernels, shows that tilted kernels can achieve qualitative robustness, and then proposes a robust KSD test that controls Type I error for distributions inside a KSD-ball around the reference model, with practical guidance for radius selection. The work provides theoretical guarantees (calibration and consistency) and empirical evidence across synthetic and real-model settings (RBMs, KEF, multimodal models), demonstrating improved robustness to outliers and tail misspecification while maintaining reasonable power. This framework enables robust GOF testing for unnormalized models and complex densities, with practical applicability through bootstrap-based thresholds and routine radius selection.

Abstract

Goodness-of-fit testing is often criticized for its lack of practical relevance: since ``all models are wrong'', the null hypothesis that the data conform to our model is ultimately always rejected as the sample size grows. Despite this, probabilistic models are still used extensively, raising the more pertinent question of whether the model is \emph{good enough} for the task at hand. This question can be formalized as a robust goodness-of-fit testing problem by asking whether the data were generated from a distribution that is a mild perturbation of the model. In this paper, we show that existing kernel goodness-of-fit tests are not robust under common notions of robustness including both qualitative and quantitative robustness. We further show that robustification techniques using tilted kernels, while effective in the parameter estimation literature, are not sufficient to ensure both types of robustness in the testing setting. To address this, we propose the first robust kernel goodness-of-fit test, which resolves this open problem by using kernel Stein discrepancy (KSD) balls. This framework encompasses many well-known perturbation models, such as Huber's contamination and density-band models.

On the Robustness of Kernel Goodness-of-Fit Tests

TL;DR

The paper tackles the robustness gap in kernel-based GOF testing by introducing a robust kernel GOF framework built on kernel Stein discrepancy (KSD) balls. It analyzes the lack of robustness of standard KSD tests with stationary kernels, shows that tilted kernels can achieve qualitative robustness, and then proposes a robust KSD test that controls Type I error for distributions inside a KSD-ball around the reference model, with practical guidance for radius selection. The work provides theoretical guarantees (calibration and consistency) and empirical evidence across synthetic and real-model settings (RBMs, KEF, multimodal models), demonstrating improved robustness to outliers and tail misspecification while maintaining reasonable power. This framework enables robust GOF testing for unnormalized models and complex densities, with practical applicability through bootstrap-based thresholds and routine radius selection.

Abstract

Goodness-of-fit testing is often criticized for its lack of practical relevance: since ``all models are wrong'', the null hypothesis that the data conform to our model is ultimately always rejected as the sample size grows. Despite this, probabilistic models are still used extensively, raising the more pertinent question of whether the model is \emph{good enough} for the task at hand. This question can be formalized as a robust goodness-of-fit testing problem by asking whether the data were generated from a distribution that is a mild perturbation of the model. In this paper, we show that existing kernel goodness-of-fit tests are not robust under common notions of robustness including both qualitative and quantitative robustness. We further show that robustification techniques using tilted kernels, while effective in the parameter estimation literature, are not sufficient to ensure both types of robustness in the testing setting. To address this, we propose the first robust kernel goodness-of-fit test, which resolves this open problem by using kernel Stein discrepancy (KSD) balls. This framework encompasses many well-known perturbation models, such as Huber's contamination and density-band models.
Paper Structure (57 sections, 24 theorems, 152 equations, 15 figures, 1 table, 1 algorithm)

This paper contains 57 sections, 24 theorems, 152 equations, 15 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Assume $\mathbb{E}_{\mathbf{X} \sim P}[ \| \mathbf{s}_p(\mathbf{X}) \|_2^4 ] < \infty$, the function $\mathbf{x} \mapsto \| \mathbf{s}_p(\mathbf{x}) \|_2$ is unbounded, $k(\mathbf{x}, \mathbf{x}') = h(\mathbf{x} - \mathbf{x}')$ with $h \in \mathcal{C}_b^2$ and $h(0) > 0$, and assume the integrabilit Then, for any test level $\alpha \in (0, 1)$ and any sequence $\{\epsilon_n\}_{n=1}^\infty$ with $\

Figures (15)

  • Figure 1: Left. Stein kernel for $P=\mathcal{N}(0,1)$ and an IMQ kernel tilted by $w(x) = (1 + x^2)^{-b}$. The larger $b$ is, the more the tails of the function $x \mapsto u_p(x,x)$ are down-weighted. The choice $b=0$ corresponds to no weighting, reducing to an IMQ kernel. Right. The rejection probability under contamination by $R=\delta_z$ with $z = 10$.
  • Figure 2: Rejection probability of robust-KSD with different bandwidths $\lambda$. "med" is the median heuristic. "KSDAgg" is the test of schrab2022ksd. The dashed line is $\alpha = 0.05$. The vertical line is the maximal proportion of contamination $\epsilon_0 = 0.05$ controlled by robust-KSD.
  • Figure 3: Rejection probability under an outlier-contaminated Gaussian model with different outlier values $z$ and contamination ratios $\epsilon$. The grey dotted horizontal line is the test level $\alpha = 0.05$, and the black dash-dot vertical line corresponds to $\epsilon_0 = 0.05$. The KSD tests with IMQ kernel lack both qualitative and quantitative robustness since they reject even for small $z$ or $\epsilon$. The tilted-KSD test is more robust in cases where $z$ or $\epsilon$ are larger, but ultimately still reject the null due to their lack of quantitative robustness.
  • Figure 4: Heavy-tailed experiment. Left. Log densities of a standard Gaussian model and scaled t-distributions with different degree-of-freedom (dof) and moment-matched to Gaussian. Right. Rejection probability of the standard and robust tests with different $\theta$ set to control the cases $\nu \geq \nu_0$ for different values of $\nu_0$.
  • Figure 5: Gaussian-Bernoulli RBM experiment. Left. Data generated from $P$ and injected contamination in the first two dimensions. Middle. Log unnormalized densities of the data and contamination ordered from small to large; the injected contamination is indeed abnormal since they have much lower densities. Right. Probability of rejection with a Gaussian-Bernoulli RBM against the contamination ratio $\epsilon$. The robust tests are calibrated to control the Type-I error with no more than $\epsilon_0 = 0.1$ proportion of contamination.
  • ...and 10 more figures

Theorems & Definitions (38)

  • Definition 1: Qualitative robustness to a sequence of neighborhood
  • Definition 2: Quantitative robustness to a single neighborhood
  • Theorem 1
  • Remark 1: Unbounded Stein kernel
  • Remark 2: Moment conditions
  • Remark 3: Connection to separation boundaries
  • Lemma 2: Bounded Stein kernel
  • Theorem 3
  • Remark 4
  • Remark 5
  • ...and 28 more