Classifier-Based Nonparametric Sequential Hypothesis Testing

Chia-Yu Hsu; Shubhanshu Shekhar

Classifier-Based Nonparametric Sequential Hypothesis Testing

Chia-Yu Hsu, Shubhanshu Shekhar

Abstract

We consider the problem of constructing sequential power-one tests where the null and alternative classes are specified indirectly through historical or offline data. More specifically, given an offline dataset consisting of observations from $L+1$ distributions $\{P_0, P_1, \ldots, P_L\}$, and a new unlabeled data stream $\{X_t: t \geq 1\} \overset{i.i.d}{\sim} P_θ$, the goal is to decide between the null $H_0: θ= 0$, against the alternative $H_1: θ\in [L]:=\{1,\ldots,L\}$. Our main methodological contribution is a general approach for designing a level-$α$ power-one test for this problem using a multi-class classifier trained on the given offline dataset. Working under a mild "separability" condition on the distributions and the trained classifier, we obtain an upper bound on the expected stopping time of our proposed level-$α$ test, and then show that in general this cannot be improved. In addition to rejecting the null, we show that our procedure can also identify the true underlying distribution almost surely. We then establish a sufficient condition to ensure the required separability of the classifier, and provide some converse results to investigate the role of the size of the offline dataset and the family of classifiers among classifier-based tests that satisfy the level-$α$ power-one criterion. Finally, we present an extension of our analysis for the training-and-testing distribution mismatch and illustrate an application to sequential change detection. Empirical results using both synthetic and real data provide support for our theoretical results.

Classifier-Based Nonparametric Sequential Hypothesis Testing

Abstract

distributions

, and a new unlabeled data stream

, the goal is to decide between the null

, against the alternative

. Our main methodological contribution is a general approach for designing a level-

power-one test for this problem using a multi-class classifier trained on the given offline dataset. Working under a mild "separability" condition on the distributions and the trained classifier, we obtain an upper bound on the expected stopping time of our proposed level-

test, and then show that in general this cannot be improved. In addition to rejecting the null, we show that our procedure can also identify the true underlying distribution almost surely. We then establish a sufficient condition to ensure the required separability of the classifier, and provide some converse results to investigate the role of the size of the offline dataset and the family of classifiers among classifier-based tests that satisfy the level-

power-one criterion. Finally, we present an extension of our analysis for the training-and-testing distribution mismatch and illustrate an application to sequential change detection. Empirical results using both synthetic and real data provide support for our theoretical results.

Paper Structure (46 sections, 14 theorems, 97 equations, 6 figures, 5 tables)

This paper contains 46 sections, 14 theorems, 97 equations, 6 figures, 5 tables.

Introduction
Organization
Problem Formulation
Main Results
The proposed stopping rule and analysis
Proposed stopping rule
Theoretical analysis of the proposed test
Role of the offline training phase
Empirical attainability and training sample requirements
Empirical attainability
Necessary training sample requirement
Minimax lower bound
Extensions
Training-testing mismatch
Sequential change detection
...and 31 more sections

Key Result

Theorem 3.2

Our proposed stopping rule $\tau$ satisfies the following: under a given separable classifier $g\in\mathcal{G}$ for a tuple $\mathbf{P}\in\mathcal{P}_{\text{sep}}$, where $\Delta_{\theta} := \max_{m \in \mathcal{L}} \boldsymbol{p}_\theta[m] - \boldsymbol{p}_{\theta}[0] = \boldsymbol{p}_\theta[\theta] - \boldsymbol{p}_{\theta}[0]$.

Figures (6)

Figure 1: Average expected stopping time vs. desired level-$\alpha$ values under $\theta=2$: In both experiments, we consider $10$$\alpha$-levels, denoted by $\alpha[1],\ldots,\alpha[10]$. For each $\alpha[i]$, we run $300$ independent trials, where each trial corresponds to one run of the sequential test at level $\alpha[i]$. The observed average stopping time (over $300$ trials) of each $\alpha[i]$ is represented by the blue dots. The curve labeled as "Quadratic Fit" is the best match for the empirical results through scaling the constant term of the bound of the theoretical expected stopping time. Details of the quadratic fit is described in Remark \ref{['remark:quadratic_fit']}.
Figure 2: The empirical ratio of $\hat{j}_\tau$ vs. $\alpha$ for Case 1 under $\theta=2$. We also consider the $10$$\alpha$-values used in simulating the average stopping time earlier and run $300$ trials at each $\alpha$-value. The blue and orange dots depict $\{(\alpha[i],\texttt{ratio[1]}(\alpha[i]))\}_{i=1}^{10}$ and $\{(\alpha[i],\texttt{ratio[2]}(\alpha[i]))\}_{i=1}^{10}$, respectively, Note that we do not plot $\texttt{ratio[0]}(\alpha[i])$ since it is zero for any $i\in\{1,\ldots,10\}$.
Figure 3: We visualize the distribution shift in this figure. The orange and blue points represent training samples drawn from $P_0$ and $P_1$, respectively, while the red and green rectangles represent testing samples drawn from the mean-shifted distributions $\tilde{P}_0$ and $\tilde{P}_1$. The purple curve shows the decision boundary of the MLP classifier trained on samples drawn from $(P_0, P_1)$.
Figure 4: Average stopping time comparison, where both utilize the MLP classifier trained on the non-shifted tuple $\mathbf{P}$. We plot these two figures using the same $\alpha$ values and quadratic fitting method as in Figure \ref{['fig:exp1']}. However, the quadratic fit in Figure \ref{['fig:exp2_toy_stop_time']} represents the theoretical scaling trend derived from Table \ref{['tab:exp2_toy_training']}, whereas its counterpart in Figure \ref{['fig:exp2_toy_shifted_stop_time']} is based on Table \ref{['tab:exp2_toy_testing']}.
Figure 5: Average expected delay vs. $\alpha$. We consider $\alpha$ values ranging from $10^{-4}$ to $10^{-3}$, using numpy geomspace to generate $50$ logarithmically spaced levels. For each $\alpha$, we run $500$ independent trials and plot the average delay (blue dots). The orange dashed curve shows the corresponding quadratic fit, computed in the same manner as in Section \ref{['sec:verify_thm_sht']}, with the expected stopping time replaced by the expected detection delay. Note that $T_c$ in the figure corresponds to the change point $T$ used in this paper.
...and 1 more figures

Theorems & Definitions (27)

Remark 2.1: Permutation
Remark 3.1: Practical implementation of our e-process
Theorem 3.2
Proposition 3.3
Corollary 3.4
Remark 3.5: Asymptotic optimality
Remark 3.6: Training-testing mismatch
Remark 3.7: Application to the sequential change-point detection problem
Remark 3.8
Proposition 3.9: Empirical attainability (Informal)
...and 17 more

Classifier-Based Nonparametric Sequential Hypothesis Testing

Abstract

Classifier-Based Nonparametric Sequential Hypothesis Testing

Authors

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (27)