Table of Contents
Fetching ...

Synthetic Non-stationary Data Streams for Recognition of the Unknown

Joanna Komorniczak

TL;DR

This paper addresses non-stationarity in data streams by jointly modeling concept drift in known classes and the appearance of unknown classes. It introduces the Open World Data Stream Generator with Concept Non-stationarity (owdsg), a Madelon-based, Python-implemented tool that creates synthetic streams with controllable drifts, novel-class emergence, class imbalance, and optional dimensionality reduction via random projections. It presents two experiments—unsupervised drift/novelty detection across detector sensitivities and open-set recognition with incremental training—to demonstrate the generator’s utility for evaluating non-stationary streaming methods and OSR under realistic conditions. The work provides a practical framework for benchmarking detectors and incremental learners in open-world, non-stationary environments, with implications for robust streaming analytics and method development.

Abstract

The problem of data non-stationarity is commonly addressed in data stream processing. In a dynamic environment, methods should continuously be ready to analyze time-varying data -- hence, they should enable incremental training and respond to concept drifts. An equally important variability typical for non-stationary data stream environments is the emergence of new, previously unknown classes. Often, methods focus on one of these two phenomena -- detection of concept drifts or detection of novel classes -- while both difficulties can be observed in data streams. Additionally, concerning previously unknown observations, the topic of open set of classes has become particularly important in recent years, where the goal of methods is to efficiently classify within known classes and recognize objects outside the model competence. This article presents a strategy for synthetic data stream generation in which both concept drifts and the emergence of new classes representing unknown objects occur. The presented research shows how unsupervised drift detectors address the task of detecting novelty and concept drifts and demonstrates how the generated data streams can be utilized in the open set recognition task.

Synthetic Non-stationary Data Streams for Recognition of the Unknown

TL;DR

This paper addresses non-stationarity in data streams by jointly modeling concept drift in known classes and the appearance of unknown classes. It introduces the Open World Data Stream Generator with Concept Non-stationarity (owdsg), a Madelon-based, Python-implemented tool that creates synthetic streams with controllable drifts, novel-class emergence, class imbalance, and optional dimensionality reduction via random projections. It presents two experiments—unsupervised drift/novelty detection across detector sensitivities and open-set recognition with incremental training—to demonstrate the generator’s utility for evaluating non-stationary streaming methods and OSR under realistic conditions. The work provides a practical framework for benchmarking detectors and incremental learners in open-world, non-stationary environments, with implications for robust streaming analytics and method development.

Abstract

The problem of data non-stationarity is commonly addressed in data stream processing. In a dynamic environment, methods should continuously be ready to analyze time-varying data -- hence, they should enable incremental training and respond to concept drifts. An equally important variability typical for non-stationary data stream environments is the emergence of new, previously unknown classes. Often, methods focus on one of these two phenomena -- detection of concept drifts or detection of novel classes -- while both difficulties can be observed in data streams. Additionally, concerning previously unknown observations, the topic of open set of classes has become particularly important in recent years, where the goal of methods is to efficiently classify within known classes and recognize objects outside the model competence. This article presents a strategy for synthetic data stream generation in which both concept drifts and the emergence of new classes representing unknown objects occur. The presented research shows how unsupervised drift detectors address the task of detecting novelty and concept drifts and demonstrates how the generated data streams can be utilized in the open set recognition task.

Paper Structure

This paper contains 19 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Drift detection moments signaled by supervised (red) and unsupervised (blue) drift detectors in the non-stationary data stream with concept drifts in kc (marked as $D$) and the emergence of uc (marked as $N$).
  • Figure 2: The scheme of Open World Data Stream Generator with Concept Non-stationarity method, employing the Madelon static generation method (blue blocks) and organizing samples into a data stream with requested dimensionality (red blocks).
  • Figure 3: Exemplary planar data streams with a single concept drift and two novel classes. The top row shows the original labels of uc, while the bottom row uses a common label for all uc.
  • Figure 4: The class proportions in the generated streams in the context of novelty ground truth -- evenly distributed on the left and random on the right side -- and kc imbalance -- balanced on the left and imbalanced on the right side.
  • Figure 5: The overall score confusion matrices in specific chunks of the stream (top) and the plot presenting the metric value across the entire stream with a single concept drift and four novel classes.
  • ...and 2 more figures