Table of Contents
Fetching ...

The Minimax Risk in Testing Uniformity over Large Alphabets under Missing-Ball Alternatives

Alon Kipnis

TL;DR

The paper analyzes the minimax risk of testing uniformity for Poisson-distributed counts across a large alphabet under ell_p (p ≤ 2) departures from uniformity. It develops a Bayesian reduction to a structured subset of alternatives and proves uniform asymptotic normality for linear histogram tests, identifying a unique least-f favorable prior π^* that yields a minimax test ψ^*. The main results provide a precise asymptotic risk formula involving u_{oldsymbol{ε},n,N,p} and show the minimax test outperforms chi-squared and collision-based tests except in certain regimes; they also relate the Poisson minimax risk to the multinomial setting via de-Poissonization. Empirical results corroborate theoretical predictions, and the framework opens avenues for extensions to non-ball shapes and sparse alternatives with practical impact on high-dimensional goodness-of-fit testing.

Abstract

We study the problem of testing the goodness of fit of categorical count data to a Poisson distribution uniform over the categories, against a class of alternatives defined by excluding an $\ell_p$ ball, $p \leq 2$, of radius $ε$ around the uniform rate sequence. We characterize the minimax risk for this problem as the expected number of samples $n$ and the number of categories $N$ go to infinity. Our result enables constant-factor comparisons among the many estimators previously proposed for this problem, rather than comparisons only at the level of convergence rates or scaling orders of sample complexity. The minimax test relies exclusively on collisions in the small sample limit, but behaves like the chi-squared test otherwise. Empirical studies across a range of parameters show that the asymptotic risk estimate is accurate in finite samples, and that the minimax test outperforms both the chi-squared test and a test based on collisions under the least favorable alternative. Our analysis involves a reduction to a structured subset of alternatives, establishing uniform asymptotic normality for a family of linear test statistics, and solving an optimization problem over $N$-dimensional sequences akin to classical results from signal detection in Gaussian white noise. Finally, we discuss the connection to the fixed-sample-size multinomial model, arguing that the Poisson minimax risk derived here also characterizes the minimax risk of the multinomial problem.

The Minimax Risk in Testing Uniformity over Large Alphabets under Missing-Ball Alternatives

TL;DR

The paper analyzes the minimax risk of testing uniformity for Poisson-distributed counts across a large alphabet under ell_p (p ≤ 2) departures from uniformity. It develops a Bayesian reduction to a structured subset of alternatives and proves uniform asymptotic normality for linear histogram tests, identifying a unique least-f favorable prior π^* that yields a minimax test ψ^*. The main results provide a precise asymptotic risk formula involving u_{oldsymbol{ε},n,N,p} and show the minimax test outperforms chi-squared and collision-based tests except in certain regimes; they also relate the Poisson minimax risk to the multinomial setting via de-Poissonization. Empirical results corroborate theoretical predictions, and the framework opens avenues for extensions to non-ball shapes and sparse alternatives with practical impact on high-dimensional goodness-of-fit testing.

Abstract

We study the problem of testing the goodness of fit of categorical count data to a Poisson distribution uniform over the categories, against a class of alternatives defined by excluding an ball, , of radius around the uniform rate sequence. We characterize the minimax risk for this problem as the expected number of samples and the number of categories go to infinity. Our result enables constant-factor comparisons among the many estimators previously proposed for this problem, rather than comparisons only at the level of convergence rates or scaling orders of sample complexity. The minimax test relies exclusively on collisions in the small sample limit, but behaves like the chi-squared test otherwise. Empirical studies across a range of parameters show that the asymptotic risk estimate is accurate in finite samples, and that the minimax test outperforms both the chi-squared test and a test based on collisions under the least favorable alternative. Our analysis involves a reduction to a structured subset of alternatives, establishing uniform asymptotic normality for a family of linear test statistics, and solving an optimization problem over -dimensional sequences akin to classical results from signal detection in Gaussian white noise. Finally, we discuss the connection to the fixed-sample-size multinomial model, arguing that the Poisson minimax risk derived here also characterizes the minimax risk of the multinomial problem.
Paper Structure (36 sections, 16 theorems, 206 equations, 5 figures)

This paper contains 36 sections, 16 theorems, 206 equations, 5 figures.

Key Result

Corollary 1

Consider a sequence of multivariate Poisson models eq:hyp_Q indexed by $n$ and $N$, where $N$ and $n$ go to infinity. Let $\xi = \xi_{n,N}$ satisfy eq:max_test_cond. Then Additionally, let $r=r_{n,N}$ satisfy eq:L2_sep_condition. Then

Figures (5)

  • Figure 1: Conceptual sketch of the sets of alternatives $V_{\epsilon}$ (shaded red) in $N=2$ dimensions and some $p \in (1,2]$. The least favorable rate sequences in $V_\epsilon$ are typical realizations of a prior supported by the points at the boundary of the $\ell_p$ ball around the uniform rate sequence $U = (1/N,...,1/N)$ closest to the center (indicated by 4 red dots). $\mu^*=\epsilon N^{-1/p}$ is the perturbation defining the least favorable prior.
  • Figure 2: The weights of the minimax test $w^*$ of \ref{['eq:optimal_test']} and their relative expected contributions to departures from the null in the test statistic; here $n=10,000$, $p=1$, $\epsilon=0.1$. Left: Bars proportional to the coordinates of $w^*$, where $w^*_0$ is the weight for missing categories, $w^*_1$ to singletons, $w^*_2$ to exclusive collisions, and so on. Center: The expected shift in the mean of the histogram ordinates under the least favorable prior $\pi^*$ defined in \ref{['eq:pi_star_l1']}. Right: the normalized product of $w_m^* \Delta_m(\pi^*)$ indicating the expected difference each histogram ordinate contributes to the minimax test statistic under $\pi^*$ relative to the null. Different colors represent different values of $\lambda_0 = n/N$. As $\lambda_0 \to 0$, only exclusive collision count $X_2$ contributes to the test statistic, as indicated by the blue bar in the right panel.
  • Figure 3: Empirical (continuous) and theoretical (dashed) risk under the least favorable prior versus $\epsilon$ for several values of $\lambda_0 = n/N$ and $p=1$. The empirical risk in each configuration is the average error in $10,000$ Monte-Carlo trials. In each trial, we used $n=10,000$ samples from the null and $n=10,000$ samples from the alternative to evaluate the Type-I and Type-II errors, respectively.
  • Figure 4: Risk of the chi-squared test \ref{['eq:chisquared_def']} and the collision-based test \ref{['eq:collision_test']}, normalized by the asymptotic minimax risk $R^*(V_{\epsilon})$, under the least favorable prior and $p=1$. Each curve shows the ratio to $R^(V_{\epsilon})$; solid lines indicate empirical risk, and dashed lines indicate theoretical asymptotic risk. Top: $\lambda_0 = n/N = 1/10$. Bottom: $\lambda_0 = 3/4$. Empirical risks are computed as the average error over $10,000$ Monte Carlo trials. In each trial, $n=10,000$ samples are drawn from the null and $n=10,000$ samples from the alternative to estimate the Type-I and Type-II errors, respectively.
  • Figure 5: Empirical risk of the minimax test under the least favorable prior, together with the risks of the chi-squared and collision-based tests, with parameters scaled in $N$ to yield a constant minimax risk; dashed lines indicate the theoretical asymptotic risks. Top: $\lambda_0 \to 0$ as $N \to \infty$. Bottom: $\lambda_0 = 1$. In both panels, $p=1$. Theoretical risks are evaluated according to Theorem \ref{['thm:matching']} and Proposition \ref{['prop:chisq_n_collision']}; empirical risks are computed as in Fig. \ref{['fig:comparisons']}.

Theorems & Definitions (16)

  • Corollary 1
  • Lemma 2
  • Proposition 3
  • Proposition 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Corollary 8
  • Proposition 9
  • Lemma 10
  • ...and 6 more