Table of Contents
Fetching ...

Reproduction of scan B-statistic for kernel change-point detection algorithm

Zihan Wang

TL;DR

The paper tackles online change-point detection with a distribution-free approach by reproducing and evaluating the kernel-based scan B-statistic (SBSK). It builds on the maximum mean discrepancy framework, using a B-test over reference blocks to form a standardized online statistic $$Z_{B_0,t}'$$ and a stopping rule $${Z_{B_0,t}'>b}$$ with thresholds derived from an ARL approximation. By comparing SBSK to Hotelling’s $$T^2$$ and GLR across diverse change scenarios, the study demonstrates that SBSK yields consistently superior detection performance, particularly in non-Gaussian settings, and shows that subsampling can modestly improve variance estimation and detection. The findings support the practical viability of SBSK for robust, online change-point detection in real-world data streams.

Abstract

Change-point detection has garnered significant attention due to its broad range of applications, including epidemic disease outbreaks, social network evolution, image analysis, and wireless communications. In an online setting, where new data samples arrive sequentially, it is crucial to continuously test whether these samples originate from a different distribution. Ideally, the detection algorithm should be distribution-free to ensure robustness in real-world applications. In this paper, we reproduce a recently proposed online change-point detection algorithm based on an efficient kernel-based scan B-statistic, and compare its performance with two commonly used parametric statistics. Our numerical experiments demonstrate that the scan B-statistic consistently delivers superior performance. In more challenging scenarios, parametric methods may fail to detect changes, whereas the scan B-statistic successfully identifies them in a timely manner. Additionally, the use of subsampling techniques offers a modest improvement to the original algorithm.

Reproduction of scan B-statistic for kernel change-point detection algorithm

TL;DR

The paper tackles online change-point detection with a distribution-free approach by reproducing and evaluating the kernel-based scan B-statistic (SBSK). It builds on the maximum mean discrepancy framework, using a B-test over reference blocks to form a standardized online statistic and a stopping rule with thresholds derived from an ARL approximation. By comparing SBSK to Hotelling’s and GLR across diverse change scenarios, the study demonstrates that SBSK yields consistently superior detection performance, particularly in non-Gaussian settings, and shows that subsampling can modestly improve variance estimation and detection. The findings support the practical viability of SBSK for robust, online change-point detection in real-world data streams.

Abstract

Change-point detection has garnered significant attention due to its broad range of applications, including epidemic disease outbreaks, social network evolution, image analysis, and wireless communications. In an online setting, where new data samples arrive sequentially, it is crucial to continuously test whether these samples originate from a different distribution. Ideally, the detection algorithm should be distribution-free to ensure robustness in real-world applications. In this paper, we reproduce a recently proposed online change-point detection algorithm based on an efficient kernel-based scan B-statistic, and compare its performance with two commonly used parametric statistics. Our numerical experiments demonstrate that the scan B-statistic consistently delivers superior performance. In more challenging scenarios, parametric methods may fail to detect changes, whereas the scan B-statistic successfully identifies them in a timely manner. Additionally, the use of subsampling techniques offers a modest improvement to the original algorithm.
Paper Structure (5 sections, 9 equations, 4 figures)

This paper contains 5 sections, 9 equations, 4 figures.

Figures (4)

  • Figure 1: The detection delay of different statistics and different cases. The parameter is $B_0 = 20$ and thresholds for all methods are calibrated so that ARL$=5000$. The absence of boxplot means that the procedure fails to detect the change, i.e., EDD is longer than 50.
  • Figure 2: Comparison of EDD with different number of reference blocks $N$ (Panel A) and different block sizes $B_0$ (Panel B).
  • Figure 3: Comparison of EDD using Gaussian kernel (Panel A) and Laplacian kernel (Panel B) with different $\sigma$. Here, distribution shifts from $N(\mathbf{0}, I_{10})$ to $N(\mu\mathbf{1}, I_{10})$.
  • Figure 4: Comparison of completely random subsampling and using subsampling techniques when estimating the variance of $Z_B$. Given a range of target ARL values, thresholds determined from Theorem \ref{['the:2']} are shown in Panel A. Actually, $n=O\{(\hbox{log} {\rm ARL})^{1/2}\}$.