Table of Contents
Fetching ...

Neural network-based CUSUM for online change-point detection

Tingnan Gong, Junghwan Lee, Xiuyuan Cheng, Yao Xie

TL;DR

A neural network CUSUM (NN-CUSUM) is introduced (NN-CUSUM) for online change-point detection and a general theoretical condition when the trained neural networks can perform change-point detection and what losses can achieve this goal is presented.

Abstract

Change-point detection, detecting an abrupt change in the data distribution from sequential data, is a fundamental problem in statistics and machine learning. CUSUM is a popular statistical method for online change-point detection due to its efficiency from recursive computation and constant memory requirement, and it enjoys statistical optimality. CUSUM requires knowing the precise pre- and post-change distribution. However, post-change distribution is usually unknown a priori since it represents anomaly and novelty. Classic CUSUM can perform poorly when there is a model mismatch with actual data. While likelihood ratio-based methods encounter challenges facing high dimensional data, neural networks have become an emerging tool for change-point detection with computational efficiency and scalability. In this paper, we introduce a neural network CUSUM (NN-CUSUM) for online change-point detection. We also present a general theoretical condition when the trained neural networks can perform change-point detection and what losses can achieve our goal. We further extend our analysis by combining it with the Neural Tangent Kernel theory to establish learning guarantees for the standard performance metrics, including the average run length (ARL) and expected detection delay (EDD). The strong performance of NN-CUSUM is demonstrated in detecting change-point in high-dimensional data using both synthetic and real-world data.

Neural network-based CUSUM for online change-point detection

TL;DR

A neural network CUSUM (NN-CUSUM) is introduced (NN-CUSUM) for online change-point detection and a general theoretical condition when the trained neural networks can perform change-point detection and what losses can achieve this goal is presented.

Abstract

Change-point detection, detecting an abrupt change in the data distribution from sequential data, is a fundamental problem in statistics and machine learning. CUSUM is a popular statistical method for online change-point detection due to its efficiency from recursive computation and constant memory requirement, and it enjoys statistical optimality. CUSUM requires knowing the precise pre- and post-change distribution. However, post-change distribution is usually unknown a priori since it represents anomaly and novelty. Classic CUSUM can perform poorly when there is a model mismatch with actual data. While likelihood ratio-based methods encounter challenges facing high dimensional data, neural networks have become an emerging tool for change-point detection with computational efficiency and scalability. In this paper, we introduce a neural network CUSUM (NN-CUSUM) for online change-point detection. We also present a general theoretical condition when the trained neural networks can perform change-point detection and what losses can achieve our goal. We further extend our analysis by combining it with the Neural Tangent Kernel theory to establish learning guarantees for the standard performance metrics, including the average run length (ARL) and expected detection delay (EDD). The strong performance of NN-CUSUM is demonstrated in detecting change-point in high-dimensional data using both synthetic and real-world data.
Paper Structure (26 sections, 7 theorems, 59 equations, 9 figures, 5 tables)

This paper contains 26 sections, 7 theorems, 59 equations, 9 figures, 5 tables.

Key Result

Lemma 4.1

Suppose the trained neural network function $g_{\hat{\theta}}(x)$ belongs to a uniformly bounded family $\mathcal{G}$, such that $C: = \sup_{g \in \mathcal{G}} \sup_{x \in \mathcal{X}} |g(x)| < \infty$, then for any $\lambda_1 >0$, for $i=0,1$. The probability $\mathbb{P}_{i}$ is over the randomness of the testing split on $t$-th window.

Figures (9)

  • Figure 1: Illustration of the NN-CUSUM procedure, running from $t$ (left) to $t+1$ (right). When the newest batch from the streaming data (treated with label $y_i = 1$) is received, it is divided to put into the training stack and the testing stack, respectively (the oldest samples are purged from the stacks to keep a constant stack size). The neural network inherited from the previous $t$ will be updated using a stochastic gradient descent algorithm for one-pass through the freshened training stack data. The updated neural network is then used to compute the test statistic for data in the freshened testing stack data. The label $y_i=0$ (from reference samples) is constructed completely in parallel. The training/testing stack sizes are hyper-parameters to be tuned in practice to achieve a good detection performance.
  • Figure 2: One sampled trajectory of test statistics $S_t$ of NN-CUSUM on sequential detection of Higgs boson baldi2014searching. The red dashed line denotes the change point when signal changes from background signal to Higgs boson producing signal. The blue dotted line denotes the approximate threshold for detection; this shows that we can choose a threshold to detect the change quickly after it has occurred.
  • Figure 3: The tail probability $\mathbb P_0\{\tau>t\}$ versus $t$. The data consists of 500 sequences with lengths of 40000, generated from i.i.d. Gaussian distribution with a dimension of 100. On this data, we perform NN-CUSUM to get $\eta_t$s. The threshold $b = 1$. The solid curve shows numerical values from 500 Monte Carlo repetitions. The dashed curve represents the theoretical values, $e^{-\lambda t}$ with $\lambda = 6\times 10^{-4}$, which is fitted by non-linear regression on the numerical values.
  • Figure 4: Numerically estimate ARL versus threshold $b$; data is i.i.d. following a Gaussian mixture model $x_i\sim 1/2 \mathcal{N}(2 {\bf 1}, I_d) + 1/2 \mathcal{N}(-2 {\bf 1}, I_d)$ where ${\bf 1}$ denotes an all-one vector, and $I_d$ is a $d$-by-$d$ identity matrix (corresponds to the $f_0$ of the fourth example in Table \ref{['tab:simulated-distributions']}). The experiment consists of 400 sequences of length 500.
  • Figure 5: Pre- and post-change distributions of $\eta_t$ in the Gaussian sparse mean shift example.
  • ...and 4 more figures

Theorems & Definitions (15)

  • Lemma 4.1: Concentration on test samples
  • Lemma 4.2: Guarantee under MMD loss
  • Lemma 4.3: Wald's identity
  • Lemma 4.4: Finite expected stopping time for general CUSUM with i.i.d. increments
  • Theorem 4.5: EDD bound for NN-CUSUM, i.i.d. increments
  • Remark 4.1
  • Remark 4.2: ARL approximation for NN-CUSUM, i.i.d. increments
  • Lemma 4.6: Generalized Wald's identity for $m$-dependent sequence
  • Theorem 4.7: EDD upper bound under $m$-dependent stationary process
  • Remark 4.3: Effect of window length $w$
  • ...and 5 more