Table of Contents
Fetching ...

Describing Nonstationary Data Streams in Frequency Domain

Joanna Komorniczak

TL;DR

Nonstationary data streams create concept drift challenges for drift detectors relying on metadescriptions. We present the Frequency Filtering Metadescriptor (ffm), a post-hoc unsupervised method that extracts $n$ frequency components with the largest variance from the per-chunk frequency representation of data, formed by averaging samples in the frequency domain via $\mathcal{F}$ and retaining the first $d/2$ real components. The final metadescription $R$ can be clustered into $c$ concepts with $k$-means and visualized by an inverse transform, enabling concept identification and drift explanation in high-dimensional streams. Across synthetic and real-world streams, ffm is competitive with PCA and state-of-the-art metadescriptions, offering interpretable frequency-based features and the ability to infer the number of concepts, with potential extensions to incremental processing and reconstruction of the original features.

Abstract

Concept drift is among the primary challenges faced by the data stream processing methods. The drift detection strategies, designed to counteract the negative consequences of such changes, often rely on analyzing the problem metafeatures. This work presents the Frequency Filtering Metadescriptor -- a tool for characterizing the data stream that searches for the informative frequency components visible in the sample's feature vector. The frequencies are filtered according to their variance across all available data batches. The presented solution is capable of generating a metadescription of the data stream, separating chunks into groups describing specific concepts on its basis, and visualizing the frequencies in the original spatial domain. The experimental analysis compared the proposed solution with two state-of-the-art strategies and with the PCA baseline in the post-hoc concept identification task. The research is followed by the identification of concepts in the real-world data streams. The generalization in the frequency domain adapted in the proposed solution allows to capture the complex feature dependencies as a reduced number of frequency components, while maintaining the semantic meaning of data.

Describing Nonstationary Data Streams in Frequency Domain

TL;DR

Nonstationary data streams create concept drift challenges for drift detectors relying on metadescriptions. We present the Frequency Filtering Metadescriptor (ffm), a post-hoc unsupervised method that extracts frequency components with the largest variance from the per-chunk frequency representation of data, formed by averaging samples in the frequency domain via and retaining the first real components. The final metadescription can be clustered into concepts with -means and visualized by an inverse transform, enabling concept identification and drift explanation in high-dimensional streams. Across synthetic and real-world streams, ffm is competitive with PCA and state-of-the-art metadescriptions, offering interpretable frequency-based features and the ability to infer the number of concepts, with potential extensions to incremental processing and reconstruction of the original features.

Abstract

Concept drift is among the primary challenges faced by the data stream processing methods. The drift detection strategies, designed to counteract the negative consequences of such changes, often rely on analyzing the problem metafeatures. This work presents the Frequency Filtering Metadescriptor -- a tool for characterizing the data stream that searches for the informative frequency components visible in the sample's feature vector. The frequencies are filtered according to their variance across all available data batches. The presented solution is capable of generating a metadescription of the data stream, separating chunks into groups describing specific concepts on its basis, and visualizing the frequencies in the original spatial domain. The experimental analysis compared the proposed solution with two state-of-the-art strategies and with the PCA baseline in the post-hoc concept identification task. The research is followed by the identification of concepts in the real-world data streams. The generalization in the frequency domain adapted in the proposed solution allows to capture the complex feature dependencies as a reduced number of frequency components, while maintaining the semantic meaning of data.

Paper Structure

This paper contains 19 sections, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: The data stream frequency representation clustered into four concepts based on the frequency representation. The specific colors identify the clusters obtained with k-means algorithm.
  • Figure 2: The visual representation of data chunks, generated using the $n=16$ frequency components. The first row presents the chunks from the first part of the stream, and the second one -- the chunks from the following one. In the presented stream, the gradual drift was injected, resulting in a smooth transition between concepts.
  • Figure 3: The relation between normalized mutual information and the value of $n$ hyperparameter for various chunk sizes (columns) and various numbers of drifts (line colors). The values of the x-axis determine the value of the $n$ hyperparameter.
  • Figure 4: The results of the second experiments across four different metrics (in columns) and for three considered types of drifts (in rows). The color of the bar plot is dependent on the obtained metric value -- the higher results are closer to red.
  • Figure 5: The results of the experiment in the form of a heatmap with interpolated values. The background color describes the average silhouette score of the clustering task. The horizontal axis shows the true number of concepts in the stream, and the vertical axis -- the considered number of concepts. The red point identified the number of concepts with the highest score for each processed stream type.
  • ...and 2 more figures