Table of Contents
Fetching ...

Classifier-Based Nonparametric Sequential Hypothesis Testing

Chia-Yu Hsu, Shubhanshu Shekhar

Abstract

We consider the problem of constructing sequential power-one tests where the null and alternative classes are specified indirectly through historical or offline data. More specifically, given an offline dataset consisting of observations from $L+1$ distributions $\{P_0, P_1, \ldots, P_L\}$, and a new unlabeled data stream $\{X_t: t \geq 1\} \overset{i.i.d}{\sim} P_θ$, the goal is to decide between the null $H_0: θ= 0$, against the alternative $H_1: θ\in [L]:=\{1,\ldots,L\}$. Our main methodological contribution is a general approach for designing a level-$α$ power-one test for this problem using a multi-class classifier trained on the given offline dataset. Working under a mild "separability" condition on the distributions and the trained classifier, we obtain an upper bound on the expected stopping time of our proposed level-$α$ test, and then show that in general this cannot be improved. In addition to rejecting the null, we show that our procedure can also identify the true underlying distribution almost surely. We then establish a sufficient condition to ensure the required separability of the classifier, and provide some converse results to investigate the role of the size of the offline dataset and the family of classifiers among classifier-based tests that satisfy the level-$α$ power-one criterion. Finally, we present an extension of our analysis for the training-and-testing distribution mismatch and illustrate an application to sequential change detection. Empirical results using both synthetic and real data provide support for our theoretical results.

Classifier-Based Nonparametric Sequential Hypothesis Testing

Abstract

We consider the problem of constructing sequential power-one tests where the null and alternative classes are specified indirectly through historical or offline data. More specifically, given an offline dataset consisting of observations from distributions , and a new unlabeled data stream , the goal is to decide between the null , against the alternative . Our main methodological contribution is a general approach for designing a level- power-one test for this problem using a multi-class classifier trained on the given offline dataset. Working under a mild "separability" condition on the distributions and the trained classifier, we obtain an upper bound on the expected stopping time of our proposed level- test, and then show that in general this cannot be improved. In addition to rejecting the null, we show that our procedure can also identify the true underlying distribution almost surely. We then establish a sufficient condition to ensure the required separability of the classifier, and provide some converse results to investigate the role of the size of the offline dataset and the family of classifiers among classifier-based tests that satisfy the level- power-one criterion. Finally, we present an extension of our analysis for the training-and-testing distribution mismatch and illustrate an application to sequential change detection. Empirical results using both synthetic and real data provide support for our theoretical results.
Paper Structure (46 sections, 14 theorems, 97 equations, 6 figures, 5 tables)

This paper contains 46 sections, 14 theorems, 97 equations, 6 figures, 5 tables.

Key Result

Theorem 3.2

Our proposed stopping rule $\tau$ satisfies the following: under a given separable classifier $g\in\mathcal{G}$ for a tuple $\mathbf{P}\in\mathcal{P}_{\text{sep}}$, where $\Delta_{\theta} := \max_{m \in \mathcal{L}} \boldsymbol{p}_\theta[m] - \boldsymbol{p}_{\theta}[0] = \boldsymbol{p}_\theta[\theta] - \boldsymbol{p}_{\theta}[0]$.

Figures (6)

  • Figure 1: Average expected stopping time vs. desired level-$\alpha$ values under $\theta=2$: In both experiments, we consider $10$$\alpha$-levels, denoted by $\alpha[1],\ldots,\alpha[10]$. For each $\alpha[i]$, we run $300$ independent trials, where each trial corresponds to one run of the sequential test at level $\alpha[i]$. The observed average stopping time (over $300$ trials) of each $\alpha[i]$ is represented by the blue dots. The curve labeled as "Quadratic Fit" is the best match for the empirical results through scaling the constant term of the bound of the theoretical expected stopping time. Details of the quadratic fit is described in Remark \ref{['remark:quadratic_fit']}.
  • Figure 2: The empirical ratio of $\hat{j}_\tau$ vs. $\alpha$ for Case 1 under $\theta=2$. We also consider the $10$$\alpha$-values used in simulating the average stopping time earlier and run $300$ trials at each $\alpha$-value. The blue and orange dots depict $\{(\alpha[i],\texttt{ratio[1]}(\alpha[i]))\}_{i=1}^{10}$ and $\{(\alpha[i],\texttt{ratio[2]}(\alpha[i]))\}_{i=1}^{10}$, respectively, Note that we do not plot $\texttt{ratio[0]}(\alpha[i])$ since it is zero for any $i\in\{1,\ldots,10\}$.
  • Figure 3: We visualize the distribution shift in this figure. The orange and blue points represent training samples drawn from $P_0$ and $P_1$, respectively, while the red and green rectangles represent testing samples drawn from the mean-shifted distributions $\tilde{P}_0$ and $\tilde{P}_1$. The purple curve shows the decision boundary of the MLP classifier trained on samples drawn from $(P_0, P_1)$.
  • Figure 4: Average stopping time comparison, where both utilize the MLP classifier trained on the non-shifted tuple $\mathbf{P}$. We plot these two figures using the same $\alpha$ values and quadratic fitting method as in Figure \ref{['fig:exp1']}. However, the quadratic fit in Figure \ref{['fig:exp2_toy_stop_time']} represents the theoretical scaling trend derived from Table \ref{['tab:exp2_toy_training']}, whereas its counterpart in Figure \ref{['fig:exp2_toy_shifted_stop_time']} is based on Table \ref{['tab:exp2_toy_testing']}.
  • Figure 5: Average expected delay vs. $\alpha$. We consider $\alpha$ values ranging from $10^{-4}$ to $10^{-3}$, using numpy geomspace to generate $50$ logarithmically spaced levels. For each $\alpha$, we run $500$ independent trials and plot the average delay (blue dots). The orange dashed curve shows the corresponding quadratic fit, computed in the same manner as in Section \ref{['sec:verify_thm_sht']}, with the expected stopping time replaced by the expected detection delay. Note that $T_c$ in the figure corresponds to the change point $T$ used in this paper.
  • ...and 1 more figures

Theorems & Definitions (27)

  • Remark 2.1: Permutation
  • Remark 3.1: Practical implementation of our e-process
  • Theorem 3.2
  • Proposition 3.3
  • Corollary 3.4
  • Remark 3.5: Asymptotic optimality
  • Remark 3.6: Training-testing mismatch
  • Remark 3.7: Application to the sequential change-point detection problem
  • Remark 3.8
  • Proposition 3.9: Empirical attainability (Informal)
  • ...and 17 more