Berry-Esseen theorems for the asymptotic normality of incomplete U-statistics with Bernoulli sampling
Dennis Leung
TL;DR
This work derives Berry-Esseen bounds for the incomplete U-statistic with Bernoulli sampling $U'_{n,N}$ across three natural regimes determined by the budget $N$ relative to the raw sample size $n$. The authors combine a variable censoring technique with Stein's method for nonlinear (Studentized) statistics and an exponential tail bound for nonnegative U-statistics to obtain BE bounds that match the natural CLT rates in each regime. The results clarify how the complete U-statistic component and the Bernoulli-sampling remainder jointly govern the accuracy of the normal approximation, with explicit error terms that scale with moments like $\mathbb{E}[|g|^3]$, $U_{h^2}$, and the budget ratio. These Berry-Esseen bounds have direct relevance for uncertainty quantification in ensemble methods that rely on subsampled kernel evaluations, and the methodology paves the way for extensions to random kernels and studentized incomplete U-statistics in practice.
Abstract
There has been a resurgence of interest in incomplete U-statistics that only sum over a subset of kernel evaluations, due to their computational efficiency and asymptotic normality which can be leveraged to quantify the uncertainty of ensemble predictions in machine learning. In this paper, we study the weak convergences to normality of one such construction, the incomplete U-statistic with Bernoulli sampling, under three different regimes on the relative sizes of the raw sample and the computational budget. Under minimalistic moment assumptions, we establish accompanying Berry-Esseen bounds with the natural rates that characterize the accuracy of these normal approximations. The key ingredients in our proofs include a variable censoring technique and a methodology for establishing Berry-Esseen bounds for the so-called Studentized nonlinear statistics recently formalized in the Stein's method literature, as well as an exponential lower tail bound for non-negative kernel U-statistics.
