Table of Contents
Fetching ...

A Neighbor-Searching Discrepancy-based Drift Detection Scheme for Learning Evolving Data

Feng Gu, Jie Lu, Zhen Fang, Kun Wang, Guangquan Zhang

TL;DR

The paper tackles real concept drift in streaming data by distinguishing boundary-changing real drift from distribution-shaping virtual drift. It introduces Neighbor-Searching Discrepancy (NSD), a distribution-free statistic based on $k$-nearest-neighbor search, and derives its connection to Beta and Gamma distributions via the neighbor-searching volume framework. NSD enables not only detection of real drift with high power but also inference of drift direction (invasion or retreat) through the evolution of a classification gap, without resampling. Empirical results across synthetic and real-world datasets show NSD’s robustness to distribution and dimensionality, and the NSD-based drift detector often outperforms state-of-the-art methods while offering substantial computational efficiency.

Abstract

Uncertain changes in data streams present challenges for machine learning models to dynamically adapt and uphold performance in real-time. Particularly, classification boundary change, also known as real concept drift, is the major cause of classification performance deterioration. However, accurately detecting real concept drift remains challenging because the theoretical foundations of existing drift detection methods - two-sample distribution tests and monitoring classification error rate, both suffer from inherent limitations such as the inability to distinguish virtual drift (changes not affecting the classification boundary, will introduce unnecessary model maintenance), limited statistical power, or high computational cost. Furthermore, no existing detection method can provide information on the trend of the drift, which could be invaluable for model maintenance. This work presents a novel real concept drift detection method based on Neighbor-Searching Discrepancy, a new statistic that measures the classification boundary difference between two samples. The proposed method is able to detect real concept drift with high accuracy while ignoring virtual drift. It can also indicate the direction of the classification boundary change by identifying the invasion or retreat of a certain class, which is also an indicator of separability change between classes. A comprehensive evaluation of 11 experiments is conducted, including empirical verification of the proposed theory using artificial datasets, and experimental comparisons with commonly used drift handling methods on real-world datasets. The results show that the proposed theory is robust against a range of distributions and dimensions, and the drift detection method outperforms state-of-the-art alternative methods.

A Neighbor-Searching Discrepancy-based Drift Detection Scheme for Learning Evolving Data

TL;DR

The paper tackles real concept drift in streaming data by distinguishing boundary-changing real drift from distribution-shaping virtual drift. It introduces Neighbor-Searching Discrepancy (NSD), a distribution-free statistic based on -nearest-neighbor search, and derives its connection to Beta and Gamma distributions via the neighbor-searching volume framework. NSD enables not only detection of real drift with high power but also inference of drift direction (invasion or retreat) through the evolution of a classification gap, without resampling. Empirical results across synthetic and real-world datasets show NSD’s robustness to distribution and dimensionality, and the NSD-based drift detector often outperforms state-of-the-art methods while offering substantial computational efficiency.

Abstract

Uncertain changes in data streams present challenges for machine learning models to dynamically adapt and uphold performance in real-time. Particularly, classification boundary change, also known as real concept drift, is the major cause of classification performance deterioration. However, accurately detecting real concept drift remains challenging because the theoretical foundations of existing drift detection methods - two-sample distribution tests and monitoring classification error rate, both suffer from inherent limitations such as the inability to distinguish virtual drift (changes not affecting the classification boundary, will introduce unnecessary model maintenance), limited statistical power, or high computational cost. Furthermore, no existing detection method can provide information on the trend of the drift, which could be invaluable for model maintenance. This work presents a novel real concept drift detection method based on Neighbor-Searching Discrepancy, a new statistic that measures the classification boundary difference between two samples. The proposed method is able to detect real concept drift with high accuracy while ignoring virtual drift. It can also indicate the direction of the classification boundary change by identifying the invasion or retreat of a certain class, which is also an indicator of separability change between classes. A comprehensive evaluation of 11 experiments is conducted, including empirical verification of the proposed theory using artificial datasets, and experimental comparisons with commonly used drift handling methods on real-world datasets. The results show that the proposed theory is robust against a range of distributions and dimensions, and the drift detection method outperforms state-of-the-art alternative methods.
Paper Structure (13 sections, 9 theorems, 41 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 9 theorems, 41 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

For a homogeneous b.p.p, given a random variable $V(k)\sim\mathcal{V}(k,n,\lambda)$, if n is much greater than k, denoted as $n\gg k$, then $V(k)\sim\mathrm{Gamma}(k,\lambda)$.

Figures (6)

  • Figure 1: Examples of finding 3 nearest neighbors following neighbor searching of different manifold spaces.
  • Figure 2: The PDF of $\mathcal{V}(k,n,\lambda)$ converges to that of $\mathrm{Gamma}(k,\lambda)$ as $n$ becomes greater than $k$.
  • Figure 3: Neighbor-searching discrepancy equals area under the curve of Beta PDF between $[0,0.5]$.
  • Figure 4: Different types of real concept drift and virtual drift reflecting on classification gap change.
  • Figure 5: Neighbor searching is constructed by combining simple spherical searches on sample X1, and then invasion and retreat of classification gap change is tested on sample X2.
  • ...and 1 more figures

Theorems & Definitions (33)

  • Definition 1: Neighbor Searching
  • Definition 2: $k$th Nearest Neighbor
  • Remark
  • proof
  • Definition 3: Neighbor-searching Volume Distribution
  • Lemma 1
  • proof
  • Definition 4: Neighbor-searching Volume Ratio
  • Lemma 2
  • proof
  • ...and 23 more