Table of Contents
Fetching ...

E-Valuating Classifier Two-Sample Tests

Teodora Pandeva, Tim Bakker, Christian A. Naesseth, Patrick Forré

TL;DR

This work addresses robust two-sample testing in high-dimensional and sequential settings by introducing E-C2ST, a deep classifier-based test built on E-values that are valid under a null hypothesis via $\mathbb{E}_P[E] \le 1$. By combining split-likelihood ideas and predictive conditional independence testing, the authors derive both batch-wise and sequential E-processes that offer anytime-type-I-error control while improving power through multiple data splits. They establish theoretical guarantees for type I error control and consistency, and propose a practical, bounded E-variable construction with a tunable mixing parameter to stabilize performance. Empirically, E-C2ST outperforms standard p-value-based C2ST baselines across synthetic and real datasets (Blob, KDEF, MNIST) by leveraging information from all batches, with a tractable computational profile and clear guidance on batch size and initialization effects. The approach thus provides a principled, scalable framework for sequential two-sample testing in complex data domains, with potential extensions in online learning and active data selection.

Abstract

We introduce a powerful deep classifier two-sample test for high-dimensional data based on E-values, called E-value Classifier Two-Sample Test (E-C2ST). Our test combines ideas from existing work on split likelihood ratio tests and predictive independence tests. The resulting E-values are suitable for anytime-valid sequential two-sample tests. This feature allows for more effective use of data in constructing test statistics. Through simulations and real data applications, we empirically demonstrate that E-C2ST achieves enhanced statistical power by partitioning datasets into multiple batches beyond the conventional two-split (training and testing) approach of standard classifier two-sample tests. This strategy increases the power of the test while keeping the type I error well below the desired significance level.

E-Valuating Classifier Two-Sample Tests

TL;DR

This work addresses robust two-sample testing in high-dimensional and sequential settings by introducing E-C2ST, a deep classifier-based test built on E-values that are valid under a null hypothesis via . By combining split-likelihood ideas and predictive conditional independence testing, the authors derive both batch-wise and sequential E-processes that offer anytime-type-I-error control while improving power through multiple data splits. They establish theoretical guarantees for type I error control and consistency, and propose a practical, bounded E-variable construction with a tunable mixing parameter to stabilize performance. Empirically, E-C2ST outperforms standard p-value-based C2ST baselines across synthetic and real datasets (Blob, KDEF, MNIST) by leveraging information from all batches, with a tractable computational profile and clear guidance on batch size and initialization effects. The approach thus provides a principled, scalable framework for sequential two-sample testing in complex data domains, with potential extensions in online learning and active data selection.

Abstract

We introduce a powerful deep classifier two-sample test for high-dimensional data based on E-values, called E-value Classifier Two-Sample Test (E-C2ST). Our test combines ideas from existing work on split likelihood ratio tests and predictive independence tests. The resulting E-values are suitable for anytime-valid sequential two-sample tests. This feature allows for more effective use of data in constructing test statistics. Through simulations and real data applications, we empirically demonstrate that E-C2ST achieves enhanced statistical power by partitioning datasets into multiple batches beyond the conventional two-split (training and testing) approach of standard classifier two-sample tests. This strategy increases the power of the test while keeping the type I error well below the desired significance level.
Paper Structure (35 sections, 19 theorems, 83 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 35 sections, 19 theorems, 83 equations, 10 figures, 4 tables, 1 algorithm.

Key Result

Lemma 2.2

If $E^{(1)}$ is a conditional $\textsc{e}$-variable w.r.t. $\mathcal{H}_\mathsf{0}^{(1)} \subseteq \mathcal{P}(\mathcal{Y})^\mathcal{Z}$ and $E^{(2)}$ a conditional $\textsc{e}$-variable w.r.t. $\mathcal{H}_\mathsf{0}^{(2)} \subseteq \mathcal{P}(\mathcal{X})^{\mathcal{Y} \times \mathcal{Z}}$ then $ is a conditional $\textsc{e}$-variable w.r.t.: where we define the product hypothesis as: with th

Figures (10)

  • Figure 1: Type I error and Power experiments for the Blob dataset. Compared to the baselines, E-C2ST reaches maximum power faster than the baselines while maintaining the type I error strictly below the significance level.
  • Figure 2: Power analysis and type I error for the KDEF data. All methods show very good power performance. The baselines start with a higher power. However, E-C2ST reaches power one the fastest while keeping the type I error lower than the baselines. The dashed line corresponds to the significance level $\alpha=0.05.$
  • Figure 3: Power analysis for the Corrupted MNIST Data for different proportions of corruption ${p=0, 0.5, 0.7, 1}$. Compared to the baselines, E-C2ST shows the highest power. The dashed line represents the significance level.
  • Figure 4: Power experiments performed on the MNIST and KDEF datasets using different batch sizes (8, 16, 32, 64, 128). The lines indicate the estimated power. In general, we can conclude that smaller batch sizes (except very small batches) allow faster rejection of the null hypothesis in terms of number of samples, and larger batch sizes require fewer steps but more samples.
  • Figure 5: Power experiments performed on the MNIST and KDEF datasets for varying $\lambda=0.1,0.3, 0.5, 0.7, 0.9$ and fixed batch size of 32 samples. The lines indicate the estimated power. The initial value of $\lambda$ had no significant impact on the test performance in the KDEF scenario, while in the MNIST case, higher $\lambda$ values increased the test performance. This effect is due to the early stages of testing, where lower initial $\lambda$ values and the suboptimal neural network performance lead to lower batch $\textsc{e}$-values.
  • ...and 5 more figures

Theorems & Definitions (36)

  • Definition 2.1: Conditional E-variable
  • Lemma 2.2: Products of conditional E-variables (based on GdHK20)
  • Lemma 2.3: Type I error control
  • Proposition 2.1: ramdas2022gameGdHK20
  • Remark 3.1: Intuition for the $m$-th conditional $\textsc{e}$-variable.
  • Corollary 3.2: Batch-wise anytime type I error control
  • Theorem 3.3
  • Theorem 3.4
  • Remark 4.1
  • Lemma 5.1
  • ...and 26 more