Advanced Tutorial: Label-Efficient Two-Sample Tests

Weizhi Li; Visar Berisha; Gautam Dasarathy

Advanced Tutorial: Label-Efficient Two-Sample Tests

Weizhi Li, Visar Berisha, Gautam Dasarathy

TL;DR

This work tackles label-costly two-sample testing by adapting active-learning ideas to hypothesis testing, ensuring valid $p$-values and high power under limited labeling. It introduces a batch three-stage framework that learns a class-posterior predictor, uses a bimodal query to selectively label informative samples, and applies the Friedman–Rafsky test on labeled data; it also proves Type I error control and characterizes asymptotic power via mutual information. Additionally, it presents a sequential framework that yields an anytime-valid $p$-value and analyzes its asymptotic and finite-sample properties, including quantifying approximation error with $D_{\text{KL}^2}$ and irreducible error via information-density variance. The results demonstrate substantial label-efficiency gains with practical guidance for validation in digital health, cancer biomarker studies, wildlife monitoring, and other data-scarce, label-expensive settings.

Abstract

Hypothesis testing is a statistical inference approach used to determine whether data supports a specific hypothesis. An important type is the two-sample test, which evaluates whether two sets of data points are from identical distributions. This test is widely used, such as by clinical researchers comparing treatment effectiveness. This tutorial explores two-sample testing in a context where an analyst has many features from two samples, but determining the sample membership (or labels) of these features is costly. In machine learning, a similar scenario is studied in active learning. This tutorial extends active learning concepts to two-sample testing within this \textit{label-costly} setting while maintaining statistical validity and high testing power. Additionally, the tutorial discusses practical applications of these label-efficient two-sample tests.

Advanced Tutorial: Label-Efficient Two-Sample Tests

TL;DR

This work tackles label-costly two-sample testing by adapting active-learning ideas to hypothesis testing, ensuring valid

-values and high power under limited labeling. It introduces a batch three-stage framework that learns a class-posterior predictor, uses a bimodal query to selectively label informative samples, and applies the Friedman–Rafsky test on labeled data; it also proves Type I error control and characterizes asymptotic power via mutual information. Additionally, it presents a sequential framework that yields an anytime-valid

-value and analyzes its asymptotic and finite-sample properties, including quantifying approximation error with

and irreducible error via information-density variance. The results demonstrate substantial label-efficiency gains with practical guidance for validation in digital health, cancer biomarker studies, wildlife monitoring, and other data-scarce, label-expensive settings.

Abstract

Paper Structure (23 sections, 9 theorems, 24 equations, 2 figures, 2 algorithms)

This paper contains 23 sections, 9 theorems, 24 equations, 2 figures, 2 algorithms.

Introduction
Review of the Two-Sample Testing
A Traditional Two-Sample Testing Problem
Desired Properties for the Traditional Two-Sample Testing
Classical Two-Sample Tests
Nonparametric Two-Sample Testing
Sequential Nonparametric Two-Sample Testing
The Label-Efficient Two-Sample Testing Problem
Bimodal query
A batch label-efficient two-sample test
A Three-Stage Two-Sample Testing Framework
Consistent Bimodal Query Minimizes the FR Statistic $W_n$
Type I Error of the Three-Stage Framework
A Sequential Label-Efficient Two-sample Test
A Sequential Label-Efficient Framework
...and 8 more sections

Key Result

Theorem 1

henze1999multivariate Given $\mathcal{X}=\left\{\mathbf{x}_1,\cdots,\mathbf{x}_{n_0}\right\}$ and $\mathcal{Y}=\left\{\mathbf{y}_1,\cdots,\mathbf{y}_{n_1}\right\}$ which are i.i.d. realizations of $\mathbf{X}\sim p_{\mathbf{X}}\left(\mathbf{x}\right)$ and $\mathbf{Y}\sim p_{\mathbf{Y}}\left(\mathbf{

Figures (2)

Figure 1: The sequential label-efficient framework
Figure 2: An example of the proposed framework and the baseline.

Theorems & Definitions (13)

Theorem 1
Theorem 2
Theorem 3
Definition 1
Proposition 1
Theorem 4
Remark 1
Theorem 5
Theorem 6
Theorem 7
...and 3 more

Advanced Tutorial: Label-Efficient Two-Sample Tests

TL;DR

Abstract

Advanced Tutorial: Label-Efficient Two-Sample Tests

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (13)