Table of Contents
Fetching ...

MOStream: A Modular and Self-Optimizing Data Stream Clustering Algorithm

Zhengru Wang, Xin Wang, Shuhao Zhang

TL;DR

Data stream clustering must rapidly adapt to high-velocity data and evolving clusters. MOStream introduces a modular, self-optimizing framework that dynamically reconfigures four design aspects—summarizing data structure, window model, outlier detection, and refinement strategy—based on regular stream characteristics and user objectives. Through automatic design choice selection and flexible algorithm migration, MOStream outperforms nine DSC algorithms on four real and three synthetic datasets in terms of clustering accuracy and throughput. This work offers a practical, robust approach for adaptive DSC in diverse, time-sensitive applications.

Abstract

Data stream clustering is a critical operation in various real-world applications, ranging from the Internet of Things (IoT) to social media and financial systems. Existing data stream clustering algorithms, while effective to varying extents, often lack the flexibility and self-optimization capabilities needed to adapt to diverse workload characteristics such as outlier, cluster evolution and changing dimensions in data points. These limitations manifest in suboptimal clustering accuracy and computational inefficiency. In this paper, we introduce MOStream, a modular and self-optimizing data stream clustering algorithm designed to dynamically balance clustering accuracy and computational efficiency at runtime. MOStream distinguishes itself by its adaptivity, clearly demarcating four pivotal design dimensions: the summarizing data structure, the window model for handling data temporality, the outlier detection mechanism, and the refinement strategy for improving cluster quality. This clear separation facilitates flexible adaptation to varying design choices and enhances its adaptability to a wide array of application contexts. We conduct a rigorous performance evaluation of MOStream, employing diverse configurations and benchmarking it against 9 representative data stream clustering algorithms on 4 real-world datasets and 3 synthetic datasets. Our empirical results demonstrate that MOStream consistently surpasses competing algorithms in terms of clustering accuracy, processing throughput, and adaptability to varying data stream characteristics.

MOStream: A Modular and Self-Optimizing Data Stream Clustering Algorithm

TL;DR

Data stream clustering must rapidly adapt to high-velocity data and evolving clusters. MOStream introduces a modular, self-optimizing framework that dynamically reconfigures four design aspects—summarizing data structure, window model, outlier detection, and refinement strategy—based on regular stream characteristics and user objectives. Through automatic design choice selection and flexible algorithm migration, MOStream outperforms nine DSC algorithms on four real and three synthetic datasets in terms of clustering accuracy and throughput. This work offers a practical, robust approach for adaptive DSC in diverse, time-sensitive applications.

Abstract

Data stream clustering is a critical operation in various real-world applications, ranging from the Internet of Things (IoT) to social media and financial systems. Existing data stream clustering algorithms, while effective to varying extents, often lack the flexibility and self-optimization capabilities needed to adapt to diverse workload characteristics such as outlier, cluster evolution and changing dimensions in data points. These limitations manifest in suboptimal clustering accuracy and computational inefficiency. In this paper, we introduce MOStream, a modular and self-optimizing data stream clustering algorithm designed to dynamically balance clustering accuracy and computational efficiency at runtime. MOStream distinguishes itself by its adaptivity, clearly demarcating four pivotal design dimensions: the summarizing data structure, the window model for handling data temporality, the outlier detection mechanism, and the refinement strategy for improving cluster quality. This clear separation facilitates flexible adaptation to varying design choices and enhances its adaptability to a wide array of application contexts. We conduct a rigorous performance evaluation of MOStream, employing diverse configurations and benchmarking it against 9 representative data stream clustering algorithms on 4 real-world datasets and 3 synthetic datasets. Our empirical results demonstrate that MOStream consistently surpasses competing algorithms in terms of clustering accuracy, processing throughput, and adaptability to varying data stream characteristics.
Paper Structure (21 sections, 7 figures, 3 tables, 4 algorithms)

This paper contains 21 sections, 7 figures, 3 tables, 4 algorithms.

Figures (7)

  • Figure 1: Performance comparison of nine representative DSC algorithms and MOStream on KDD99 with high frequency of outlier evolution and increasing frequency of cluster evolution in the dataset.
  • Figure 2: Performance Comparison on Four Real-world Workloads. The throughput of DBStream on Insects dataset and the throughput of SL-KMeans on KDD99 dataset are much lower than other baselines thus neglected to be shown in the figure.
  • Figure 3: Performance Comparison on EDS (Figure (a), (b)) and ODS (Figure (c), (d)) workloads with varying cluster or outlier evolution frequency. EDS is divided into five main stages according to the cluster evolution frequency that increases with the increase of the stages. ODS is divided into three main stages according to the outlier evolution frequency that increases with the increase of the stages.
  • Figure 4: Performance Comparison on Dim workload with varying dimensionality.
  • Figure 5: Detailed performance analysis of MOStream's variants on KDD99. Measurements are taken at intervals of 350,000 data points with 4 phases in total.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 1