Synthetic Non-stationary Data Streams for Recognition of the Unknown
Joanna Komorniczak
TL;DR
This paper addresses non-stationarity in data streams by jointly modeling concept drift in known classes and the appearance of unknown classes. It introduces the Open World Data Stream Generator with Concept Non-stationarity (owdsg), a Madelon-based, Python-implemented tool that creates synthetic streams with controllable drifts, novel-class emergence, class imbalance, and optional dimensionality reduction via random projections. It presents two experiments—unsupervised drift/novelty detection across detector sensitivities and open-set recognition with incremental training—to demonstrate the generator’s utility for evaluating non-stationary streaming methods and OSR under realistic conditions. The work provides a practical framework for benchmarking detectors and incremental learners in open-world, non-stationary environments, with implications for robust streaming analytics and method development.
Abstract
The problem of data non-stationarity is commonly addressed in data stream processing. In a dynamic environment, methods should continuously be ready to analyze time-varying data -- hence, they should enable incremental training and respond to concept drifts. An equally important variability typical for non-stationary data stream environments is the emergence of new, previously unknown classes. Often, methods focus on one of these two phenomena -- detection of concept drifts or detection of novel classes -- while both difficulties can be observed in data streams. Additionally, concerning previously unknown observations, the topic of open set of classes has become particularly important in recent years, where the goal of methods is to efficiently classify within known classes and recognize objects outside the model competence. This article presents a strategy for synthetic data stream generation in which both concept drifts and the emergence of new classes representing unknown objects occur. The presented research shows how unsupervised drift detectors address the task of detecting novelty and concept drifts and demonstrates how the generated data streams can be utilized in the open set recognition task.
