Table of Contents
Fetching ...

Global Sequential Testing for Multi-Stream Auditing

Beepul Bharti, Ambar Pal, Jeremias Sulam

TL;DR

This work constructs new sequential tests by using ideas of merging test martingales with different trade-offs in expected stopping times under different, sparse or dense alternative hypotheses, and derives a new, balanced test that achieves an improved expected stopping time bound that matches Bonferroni's in the sparse setting but that naturally results in a dense alternative under a dense alternative.

Abstract

Across many risk-sensitive areas, it is critical to continuously audit the performance of machine learning systems and detect any unusual behavior quickly. This can be modeled as a sequential hypothesis testing problem with $k$ incoming streams of data and a global null hypothesis that asserts that the system is working as expected across all $k$ streams. The standard global test employs a Bonferroni correction and has an expected stopping time bound of $O\left(\ln\frac{k}α\right)$ when $k$ is large and the significance level of the test, $α$, is small. In this work, we construct new sequential tests by using ideas of merging test martingales with different trade-offs in expected stopping times under different, sparse or dense alternative hypotheses. We further derive a new, balanced test that achieves an improved expected stopping time bound that matches Bonferroni's in the sparse setting but that naturally results in $O\left(\frac{1}{k}\ln\frac{1}α\right)$ under a dense alternative. We empirically demonstrate the effectiveness of our proposed tests on synthetic and real-world data.

Global Sequential Testing for Multi-Stream Auditing

TL;DR

This work constructs new sequential tests by using ideas of merging test martingales with different trade-offs in expected stopping times under different, sparse or dense alternative hypotheses, and derives a new, balanced test that achieves an improved expected stopping time bound that matches Bonferroni's in the sparse setting but that naturally results in a dense alternative under a dense alternative.

Abstract

Across many risk-sensitive areas, it is critical to continuously audit the performance of machine learning systems and detect any unusual behavior quickly. This can be modeled as a sequential hypothesis testing problem with incoming streams of data and a global null hypothesis that asserts that the system is working as expected across all streams. The standard global test employs a Bonferroni correction and has an expected stopping time bound of when is large and the significance level of the test, , is small. In this work, we construct new sequential tests by using ideas of merging test martingales with different trade-offs in expected stopping times under different, sparse or dense alternative hypotheses. We further derive a new, balanced test that achieves an improved expected stopping time bound that matches Bonferroni's in the sparse setting but that naturally results in under a dense alternative. We empirically demonstrate the effectiveness of our proposed tests on synthetic and real-world data.
Paper Structure (31 sections, 22 theorems, 147 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 31 sections, 22 theorems, 147 equations, 11 figures, 3 tables, 2 algorithms.

Key Result

Theorem 3.1

Let $M$ be a nonnegative $P$-supermartingale with an initial value $M_0 \geq 0$. Then $\forall \alpha > 0$, $P(\exists t\geq 1: M_t \geq 1/\alpha) \leq \alpha{\mathbb{E}}_{P}[M_0]$.

Figures (11)

  • Figure 1: Top: Distribution of stopping times, over 1,000 simulations, for various sequential tests across settings with varying proportions of streams with nonzero means. A test rejects when its corresponding wealth process exceeds $1/\alpha$ for $\alpha = 0.01$. The dashed vertical line is the empirical mean of the stopping times. Bottom: Trajectories of various wealth processes across settings with different amounts of nonzero means. Each line represents the median trajectory of a wealth process over 1,000 simulations, with shaded areas indicating the 25% and 75% quantiles. The y-axis is presented on a logarithmic scale. Wealth processes are clipped to $10^{-3}$ for visualization purposes.
  • Figure 2: Left plot of each figure: Distribution of stopping times, over 1,000 runs, for various sequential tests. A test rejects when its corresponding wealth process exceeds $1/\alpha$ for $\alpha = 0.01$. The dashed vertical line is the empirical mean of the stopping times. Right plot of each figure: Various wealth process trajectories. Each line represents the median trajectory of a wealth process over 1,000 runs, with shaded areas indicating the 25% and 75% quantiles.
  • Figure :
  • Figure :
  • Figure B.1: Top: Distribution of stopping times, over 1,000 simulations, for various sequential tests across settings with varying proportions of streams with nonzero means. A test rejects when its corresponding wealth process exceeds $1/\alpha$ for $\alpha = 0.01$. The dashed vertical line is the empirical mean of the stopping times. Bottom: Trajectories of various wealth processes across settings with different amounts of nonzero means. Each line represents the median trajectory of a wealth process over 1,000 simulations, with shaded areas indicating the 25% and 75% quantiles. The y-axis is presented on a logarithmic scale. Wealth processes are clipped to $10^{-3}$ for visualization purposes.
  • ...and 6 more figures

Theorems & Definitions (35)

  • Definition 3.1: Level-$\alpha$ Sequential Test
  • Definition 3.2: Stopping Time
  • Theorem 3.1: Ville's Inequality ville1939etude
  • Proposition 4.1: chugg2023auditing
  • Theorem 4.1
  • Theorem 4.2: Stopping time of $\phi^{\textsf{ftrl}}$
  • Theorem 5.1: Stopping time of $\phi^{\textsf{prod}}$
  • Theorem 5.2: Stopping time of $\phi^{\textsf{ave}}$
  • Theorem 5.3: Stopping time of $\phi^{\textsf{balance}}$
  • proof
  • ...and 25 more