E-values for k-Sample Tests With Exponential Families

Yunda Hao; Peter Grünwald; Tyron Lardy; Long Long; Reuben Adams

E-values for k-Sample Tests With Exponential Families

Yunda Hao, Peter Grünwald, Tyron Lardy, Long Long, Reuben Adams

Abstract

We develop and compare e-variables for testing whether $k$ samples of data are drawn from the same distribution, the alternative being that they come from different elements of an exponential family. We consider the GRO (growth-rate optimal) e-variables for (1) a `small' null inside the same exponential family, and (2) a `large' nonparametric null, as well as (3) an e-variable arrived at by conditioning on the sum of the sufficient statistics. (2) and (3) are efficiently computable, and extend ideas from Turner et al. [2021] and Wald [1947] respectively from Bernoulli to general exponential families. We provide theoretical and simulation-based comparisons of these e-variables in terms of their logarithmic growth rate, and find that for small effects all four e-variables behave surprisingly similarly; for the Gaussian location and Poisson families, e-variables (1) and (3) coincide; for Bernoulli, (1) and (2) coincide; but in general, whether (2) or (3) grows faster under the alternative is family-dependent. We furthermore discuss algorithms for numerically approximating (1).

E-values for k-Sample Tests With Exponential Families

Abstract

We develop and compare e-variables for testing whether

samples of data are drawn from the same distribution, the alternative being that they come from different elements of an exponential family. We consider the GRO (growth-rate optimal) e-variables for (1) a `small' null inside the same exponential family, and (2) a `large' nonparametric null, as well as (3) an e-variable arrived at by conditioning on the sum of the sufficient statistics. (2) and (3) are efficiently computable, and extend ideas from Turner et al. [2021] and Wald [1947] respectively from Bernoulli to general exponential families. We provide theoretical and simulation-based comparisons of these e-variables in terms of their logarithmic growth rate, and find that for small effects all four e-variables behave surprisingly similarly; for the Gaussian location and Poisson families, e-variables (1) and (3) coincide; for Bernoulli, (1) and (2) coincide; but in general, whether (2) or (3) grows faster under the alternative is family-dependent. We furthermore discuss algorithms for numerically approximating (1).

Paper Structure (36 sections, 10 theorems, 85 equations, 5 figures, 3 tables)

This paper contains 36 sections, 10 theorems, 85 equations, 5 figures, 3 tables.

Introduction
Results
Method: Restriction to Single Blocks and Simple Alternatives
Related Work and Practical Relevance
Contents
Formal Setting
The GRO E-variable for General H_0
The Four Types of E-variables
The GRO E-variable for H_0(M) and the pseudo e-variable
The GRO E-variable for H_0(iid)
The Conditional E-variable Scond
Growth Rate Comparison of Our E-variables
Growth Rate Comparison for Specific Exponential Families
Simulations to Approximate the RIPr
Approximating the RIPr via Li's Algorithm
...and 21 more sections

Key Result

Lemma 1

Let ${\cal P}$ be a set of probability distributions on ${\cal X}^k$ and let $\textsc{conv}({\cal P})$ be its convex hull. Then there exists a sub-probability measure $P^*_0$ with density $p^*_0$ such that $P^*_0$ is called the reverse information projection (RIPr) of $P_{\bm \mu}$ onto $\textsc{conv}({\cal P})$.

Figures (5)

Figure 1: A comparison of ${S}_{\textsc{gro}(\textsc{iid})}$ and ${S}_{\textsc{cond}}$ for four exponential families. We evaluated the expected growth difference on a grid of $50 \times 50$ alternatives $(\mu_1,\mu_2)$, equally spaced in the standard parameterization (explaining the nonlinear scaling on the depicted mean-value parameterization). On the left are the corresponding heatmaps. On the right are diagonal 'slices' of these heatmaps: the red curve corresponds to the main diagonal (top left - bottom right), the blue curve corresponds to the diagonal starting from the second tick mark (10th discretization point) top left until the second tick mark bottom right. These slices are symmetric around 0, their value only depending on $\delta = \mid \mu_1 - \mu_2\mid /\sqrt{2} = \mid \mu_1 - \mu^*_0\mid \cdot \sqrt{2}$, where $\mu_0^* = (\mu_1 + \mu_2)/2$ and $\delta$ is as in Theorem \ref{['Taylor-approximation']}
Figure 2: Exponential distribution. On the right, $n$ represents number of iterations with Li's algorithm, starting at iteration 2
Figure 3: beta with free $\beta$ and fixed $\alpha$. On the right, $n$ represents number of iterations with Li's algorithm, starting at iteration 2
Figure 4: geometric distribution. On the right, $n$ represents number of iterations with Li's algorithm, starting at iteration 3
Figure 5: Gaussian with free variance and fixed mean. On the right, $n$ represents number of iterations with Li's algorithm, starting at iteration 3

Theorems & Definitions (25)

Definition 1
Lemma 1
Definition 2
Lemma 2
Definition 3
Proposition 1
Proposition 2
Theorem 1
Definition 4
Definition 5
...and 15 more

E-values for k-Sample Tests With Exponential Families

Abstract

E-values for k-Sample Tests With Exponential Families

Authors

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (25)