Asymptotically well-calibrated Bayesian $p$-value using the Kolmogorov-Smirnov statistic
Yueming Shen, Surya Tokdar
TL;DR
This work addresses the conservatism and potential non-uniformity of the posterior predictive p-value in Bayesian model checking. It develops a general criterion for asymptotic well-calibration and proves that Kolmogorov–Smirnov type statistics, including the classical KS statistic and its regression-generalized form, satisfy this criterion under contiguous alternatives. The authors show that ppp(CKS) and ppp(GKS) are asymptotically Uniform$(0,1)$ and demonstrate their finite-sample reliability and power through Gamma model and Gamma GLM simulations, including comparisons with chi-squared and score tests. The results provide robust, omnibus Bayesian model-checking tools that integrate naturally with posterior predictive workflows and extend calibration theory beyond asymptotically normal statistics, with practical applicability to common regression models such as Gamma GLMs.
Abstract
The posterior predictive $p$-value (ppp) is widely used in Bayesian model evaluation. However, due to double use of the data, the ppp may not be a valid $p$-value even in large samples: The asymptotic null distribution of the ppp can be non-uniform unless the underlying test statistic satisfies certain well-calibration conditions. Such conditions have been studied in the literature for asymptotically normal test statistics. We extend this line of work by establishing well-calibration conditions for test statistics that are not necessarily asymptotically normal. In particular, we show that Kolmogorov-Smirnov (KS)-type test statistics satisfy these conditions, such that their ppps are asymptotically well-calibrated Bayesian $p$-values. KS-type statistics are versatile, omnibus, and sensitive to model misspecifications. They apply to i.i.d. real-valued data, as well as non-identically distributed observations under regression models. Numerical experiments demonstrate that such $p$-values are well behaved in finite samples and can effectively detect a wide range of alternative models.
