Advanced Tutorial: Label-Efficient Two-Sample Tests
Weizhi Li, Visar Berisha, Gautam Dasarathy
TL;DR
This work tackles label-costly two-sample testing by adapting active-learning ideas to hypothesis testing, ensuring valid $p$-values and high power under limited labeling. It introduces a batch three-stage framework that learns a class-posterior predictor, uses a bimodal query to selectively label informative samples, and applies the Friedman–Rafsky test on labeled data; it also proves Type I error control and characterizes asymptotic power via mutual information. Additionally, it presents a sequential framework that yields an anytime-valid $p$-value and analyzes its asymptotic and finite-sample properties, including quantifying approximation error with $D_{\text{KL}^2}$ and irreducible error via information-density variance. The results demonstrate substantial label-efficiency gains with practical guidance for validation in digital health, cancer biomarker studies, wildlife monitoring, and other data-scarce, label-expensive settings.
Abstract
Hypothesis testing is a statistical inference approach used to determine whether data supports a specific hypothesis. An important type is the two-sample test, which evaluates whether two sets of data points are from identical distributions. This test is widely used, such as by clinical researchers comparing treatment effectiveness. This tutorial explores two-sample testing in a context where an analyst has many features from two samples, but determining the sample membership (or labels) of these features is costly. In machine learning, a similar scenario is studied in active learning. This tutorial extends active learning concepts to two-sample testing within this \textit{label-costly} setting while maintaining statistical validity and high testing power. Additionally, the tutorial discusses practical applications of these label-efficient two-sample tests.
