Table of Contents
Fetching ...

RPS: A Generic Reservoir Patterns Sampler

Lamine Diop, Marc Plantevit, Arnaud Soulet

TL;DR

This study introduces an approach that harnesses a weighted reservoir to facilitate direct pattern sampling from streaming batch data, thus ensuring scalability and efficiency and presents a generic algorithm capable of addressing temporal biases and handling various pattern types, including sequential, weighted, and unweighted itemsets.

Abstract

Efficient learning from streaming data is important for modern data analysis due to the continuous and rapid evolution of data streams. Despite significant advancements in stream pattern mining, challenges persist, particularly in managing complex data streams like sequential and weighted itemsets. While reservoir sampling serves as a fundamental method for randomly selecting fixed-size samples from data streams, its application to such complex patterns remains largely unexplored. In this study, we introduce an approach that harnesses a weighted reservoir to facilitate direct pattern sampling from streaming batch data, thus ensuring scalability and efficiency. We present a generic algorithm capable of addressing temporal biases and handling various pattern types, including sequential, weighted, and unweighted itemsets. Through comprehensive experiments conducted on real-world datasets, we evaluate the effectiveness of our method, showcasing its ability to construct accurate incremental online classifiers for sequential data. Our approach not only enables previously unusable online machine learning models for sequential data to achieve accuracy comparable to offline baselines but also represents significant progress in the development of incremental online sequential itemset classifiers.

RPS: A Generic Reservoir Patterns Sampler

TL;DR

This study introduces an approach that harnesses a weighted reservoir to facilitate direct pattern sampling from streaming batch data, thus ensuring scalability and efficiency and presents a generic algorithm capable of addressing temporal biases and handling various pattern types, including sequential, weighted, and unweighted itemsets.

Abstract

Efficient learning from streaming data is important for modern data analysis due to the continuous and rapid evolution of data streams. Despite significant advancements in stream pattern mining, challenges persist, particularly in managing complex data streams like sequential and weighted itemsets. While reservoir sampling serves as a fundamental method for randomly selecting fixed-size samples from data streams, its application to such complex patterns remains largely unexplored. In this study, we introduce an approach that harnesses a weighted reservoir to facilitate direct pattern sampling from streaming batch data, thus ensuring scalability and efficiency. We present a generic algorithm capable of addressing temporal biases and handling various pattern types, including sequential, weighted, and unweighted itemsets. Through comprehensive experiments conducted on real-world datasets, we evaluate the effectiveness of our method, showcasing its ability to construct accurate incremental online classifiers for sequential data. Our approach not only enables previously unusable online machine learning models for sequential data to achieve accuracy comparable to offline baselines but also represents significant progress in the development of incremental online sequential itemset classifiers.

Paper Structure

This paper contains 23 sections, 13 equations, 4 figures, 6 tables, 2 algorithms.

Figures (4)

  • Figure 1: Overview of the approach (the incomplete block denotes next batch)
  • Figure 2: ${\textrm{\sc {\small RPS}}}$-based classifier framework
  • Figure 3: Evolution of the accuracy per batch with different parameters on Books. Learning timestamps are in red. $k$: reservoir size, $N$: batch size, $ld$: learning duration, $pd$: predict duration
  • Figure 4: Comparison between ${\textrm{\sc {\small RPS}}}$-based classifiers (with reservoir size $k$=10,000; batch size=1,000; learning duration=2 time-units, predict duration=52 time-units) vs cheater classifiers (with 50% train and 50% test)

Theorems & Definitions (18)

  • Definition 1: Pattern
  • Definition 2: Frequency
  • Definition 3: Norm-based utility diop2019kais
  • Definition 4: Norm-based utility measure
  • Definition 5: Global Pattern Utility
  • Definition 6: Damping function $\nabla_{\!\varepsilon\xspace}(t\xspace_n, t\xspace_j)\xspace$
  • Definition 7: Pattern Global Utility under temporal bias
  • Example 1
  • proof
  • Definition 8: Cumulative Binomial Probability Distribution (CBPD)
  • ...and 8 more