Describing Nonstationary Data Streams in Frequency Domain
Joanna Komorniczak
TL;DR
Nonstationary data streams create concept drift challenges for drift detectors relying on metadescriptions. We present the Frequency Filtering Metadescriptor (ffm), a post-hoc unsupervised method that extracts $n$ frequency components with the largest variance from the per-chunk frequency representation of data, formed by averaging samples in the frequency domain via $\mathcal{F}$ and retaining the first $d/2$ real components. The final metadescription $R$ can be clustered into $c$ concepts with $k$-means and visualized by an inverse transform, enabling concept identification and drift explanation in high-dimensional streams. Across synthetic and real-world streams, ffm is competitive with PCA and state-of-the-art metadescriptions, offering interpretable frequency-based features and the ability to infer the number of concepts, with potential extensions to incremental processing and reconstruction of the original features.
Abstract
Concept drift is among the primary challenges faced by the data stream processing methods. The drift detection strategies, designed to counteract the negative consequences of such changes, often rely on analyzing the problem metafeatures. This work presents the Frequency Filtering Metadescriptor -- a tool for characterizing the data stream that searches for the informative frequency components visible in the sample's feature vector. The frequencies are filtered according to their variance across all available data batches. The presented solution is capable of generating a metadescription of the data stream, separating chunks into groups describing specific concepts on its basis, and visualizing the frequencies in the original spatial domain. The experimental analysis compared the proposed solution with two state-of-the-art strategies and with the PCA baseline in the post-hoc concept identification task. The research is followed by the identification of concepts in the real-world data streams. The generalization in the frequency domain adapted in the proposed solution allows to capture the complex feature dependencies as a reduced number of frequency components, while maintaining the semantic meaning of data.
