Table of Contents
Fetching ...

Berry-Esseen theorems for the asymptotic normality of incomplete U-statistics with Bernoulli sampling

Dennis Leung

TL;DR

This work derives Berry-Esseen bounds for the incomplete U-statistic with Bernoulli sampling $U'_{n,N}$ across three natural regimes determined by the budget $N$ relative to the raw sample size $n$. The authors combine a variable censoring technique with Stein's method for nonlinear (Studentized) statistics and an exponential tail bound for nonnegative U-statistics to obtain BE bounds that match the natural CLT rates in each regime. The results clarify how the complete U-statistic component and the Bernoulli-sampling remainder jointly govern the accuracy of the normal approximation, with explicit error terms that scale with moments like $\mathbb{E}[|g|^3]$, $U_{h^2}$, and the budget ratio. These Berry-Esseen bounds have direct relevance for uncertainty quantification in ensemble methods that rely on subsampled kernel evaluations, and the methodology paves the way for extensions to random kernels and studentized incomplete U-statistics in practice.

Abstract

There has been a resurgence of interest in incomplete U-statistics that only sum over a subset of kernel evaluations, due to their computational efficiency and asymptotic normality which can be leveraged to quantify the uncertainty of ensemble predictions in machine learning. In this paper, we study the weak convergences to normality of one such construction, the incomplete U-statistic with Bernoulli sampling, under three different regimes on the relative sizes of the raw sample and the computational budget. Under minimalistic moment assumptions, we establish accompanying Berry-Esseen bounds with the natural rates that characterize the accuracy of these normal approximations. The key ingredients in our proofs include a variable censoring technique and a methodology for establishing Berry-Esseen bounds for the so-called Studentized nonlinear statistics recently formalized in the Stein's method literature, as well as an exponential lower tail bound for non-negative kernel U-statistics.

Berry-Esseen theorems for the asymptotic normality of incomplete U-statistics with Bernoulli sampling

TL;DR

This work derives Berry-Esseen bounds for the incomplete U-statistic with Bernoulli sampling across three natural regimes determined by the budget relative to the raw sample size . The authors combine a variable censoring technique with Stein's method for nonlinear (Studentized) statistics and an exponential tail bound for nonnegative U-statistics to obtain BE bounds that match the natural CLT rates in each regime. The results clarify how the complete U-statistic component and the Bernoulli-sampling remainder jointly govern the accuracy of the normal approximation, with explicit error terms that scale with moments like , , and the budget ratio. These Berry-Esseen bounds have direct relevance for uncertainty quantification in ensemble methods that rely on subsampled kernel evaluations, and the methodology paves the way for extensions to random kernels and studentized incomplete U-statistics in practice.

Abstract

There has been a resurgence of interest in incomplete U-statistics that only sum over a subset of kernel evaluations, due to their computational efficiency and asymptotic normality which can be leveraged to quantify the uncertainty of ensemble predictions in machine learning. In this paper, we study the weak convergences to normality of one such construction, the incomplete U-statistic with Bernoulli sampling, under three different regimes on the relative sizes of the raw sample and the computational budget. Under minimalistic moment assumptions, we establish accompanying Berry-Esseen bounds with the natural rates that characterize the accuracy of these normal approximations. The key ingredients in our proofs include a variable censoring technique and a methodology for establishing Berry-Esseen bounds for the so-called Studentized nonlinear statistics recently formalized in the Stein's method literature, as well as an exponential lower tail bound for non-negative kernel U-statistics.
Paper Structure (41 sections, 13 theorems, 303 equations)

This paper contains 41 sections, 13 theorems, 303 equations.

Key Result

Theorem 1.1

Fix $m \geq 2$. Suppose both $n$ and $N$ tend to $\infty$ and the assumptions in mean0_assumption holds.

Theorems & Definitions (16)

  • Theorem 1.1: Normal convergence of $U_{n, N}'$ for a given degree $m$
  • Theorem 3.1: Berry-Esseen theorem for $U_{n ,N}'$, "$N \gg n$"
  • Theorem 3.2: Berry-Esseen theorem for $U_{n, N}'$, "$N \ll n^d$"
  • Theorem 3.3: Berry-Esseen theorem for $U_{n, N}'$, "$N \asymp n$"
  • Lemma 4.1: B-E bound for nonlinear statistics
  • Lemma 4.2: Estimates regarding $\Delta_1$ and $\Delta_2$
  • Lemma 5.1: Exponential lower tail bound for U-statistics with non-negative kernels, leungshao2024nonuniform
  • Lemma 5.2: B-E bound for $|P( T_{SN} \leq \mathfrak{z}_\mathcal{X}) - \mathds{E} \hbox{[}\Phi(\mathfrak{z}_\mathcal{X}) \hbox{]} |$
  • Lemma 5.5: Bounds on $T(D_1)$
  • Lemma 5.6: Bounds on $T(D_2)$
  • ...and 6 more