Table of Contents
Fetching ...

A Comprehensive Scalable Framework for Cloud-Native Pattern Detection with Enhanced Expressiveness

Ioannis Mavroudopoulos, Anastasios Gounaris

TL;DR

This paper tackles scalable detection of arbitrary complex patterns in massive log datasets by decoupling storage from query processing and embedding a CEP engine (SASE) within a Spark-based query processor. The authors introduce SIESTA, an indexing-based framework that uses inverted indices and event-pairs to enable efficient pattern queries with gap/time constraints and CEP operators, while offering explainability for non-matches. Key contributions include an extended query processor design, a dual-storage strategy comparing Cassandra and OSS with Parquet, incremental indexing with lookback partitioning, and integration of SASE for final validation and explainability. Thorough evaluations against ELK, FlinkCEP, MR, Signatures, and Set-Containment demonstrate superior scalability for large patterns, faster responses on complex queries, and favorable cost-performance trade-offs with OSS. The work provides a practical, cloud-native approach to scalable, expressive pattern detection in real-world log analytics scenarios, with open-source implementations and reproducible experiments.

Abstract

Detecting complex patterns in large volumes of event logs has diverse applications in various domains, such as business processes and fraud detection. Existing systems like ELK are commonly used to tackle this challenge, but their performance deteriorates for large patterns, while they suffer from limitations in terms of expressiveness and explanatory capabilities for their responses. In this work, we propose a solution that integrates a Complex Event Processing (CEP) engine into a broader query processsor on top of a decoupled storage infrastructure containing inverted indices of log events. The results demonstrate that our system excels in scalability and robustness, particularly in handling complex queries. Notably, our proposed system delivers responses for large complex patterns within seconds, while ELK experiences timeouts after 10 minutes. It also significantly outperforms solutions relying on FlinkCEP and executing MATCH_RECOGNIZE SQL queries.

A Comprehensive Scalable Framework for Cloud-Native Pattern Detection with Enhanced Expressiveness

TL;DR

This paper tackles scalable detection of arbitrary complex patterns in massive log datasets by decoupling storage from query processing and embedding a CEP engine (SASE) within a Spark-based query processor. The authors introduce SIESTA, an indexing-based framework that uses inverted indices and event-pairs to enable efficient pattern queries with gap/time constraints and CEP operators, while offering explainability for non-matches. Key contributions include an extended query processor design, a dual-storage strategy comparing Cassandra and OSS with Parquet, incremental indexing with lookback partitioning, and integration of SASE for final validation and explainability. Thorough evaluations against ELK, FlinkCEP, MR, Signatures, and Set-Containment demonstrate superior scalability for large patterns, faster responses on complex queries, and favorable cost-performance trade-offs with OSS. The work provides a practical, cloud-native approach to scalable, expressive pattern detection in real-world log analytics scenarios, with open-source implementations and reproducible experiments.

Abstract

Detecting complex patterns in large volumes of event logs has diverse applications in various domains, such as business processes and fraud detection. Existing systems like ELK are commonly used to tackle this challenge, but their performance deteriorates for large patterns, while they suffer from limitations in terms of expressiveness and explanatory capabilities for their responses. In this work, we propose a solution that integrates a Complex Event Processing (CEP) engine into a broader query processsor on top of a decoupled storage infrastructure containing inverted indices of log events. The results demonstrate that our system excels in scalability and robustness, particularly in handling complex queries. Notably, our proposed system delivers responses for large complex patterns within seconds, while ELK experiences timeouts after 10 minutes. It also significantly outperforms solutions relying on FlinkCEP and executing MATCH_RECOGNIZE SQL queries.
Paper Structure (17 sections, 2 theorems, 10 figures, 3 tables, 2 algorithms)

This paper contains 17 sections, 2 theorems, 10 figures, 3 tables, 2 algorithms.

Key Result

Lemma 4.1

If a trace $t$ contains multiple events of type $a_i$, then including an et-pair ($a_i$, $a_i$) in the list of et-pairs to be queried from the IndexTable guarantees the retrieval of all events with that type.

Figures (10)

  • Figure 1: Architecture of the initial SIESTA proposal (white background) along with its extensions in this work (light blue background)
  • Figure 2: Example of indices built
  • Figure 3: Example with modified timestamps
  • Figure 4: Indexing times for the different datasets during a 10 days period.
  • Figure 5: Indexing times for different datasets and various systems.
  • ...and 5 more figures

Theorems & Definitions (7)

  • Definition 3.1
  • Definition 3.2: et-pair
  • Definition 3.3: event-pair
  • Definition 3.4: event-pairs non-overlapping in time
  • Definition 3.5: Occurrences, non-overlapping in time
  • Lemma 4.1
  • Theorem 4.2: Correctness of our proposal