QPOPSS: Query and Parallelism Optimized Space-Saving for Finding Frequent Stream Elements

Victor Jarlow; Charalampos Stylianopoulos; Marina Papatriantafilou

QPOPSS: Query and Parallelism Optimized Space-Saving for Finding Frequent Stream Elements

Victor Jarlow, Charalampos Stylianopoulos, Marina Papatriantafilou

TL;DR

This work proposes Query and Parallelism Optimized Space-Saving (QPOPSS), providing concurrency guarantees, and empirical evaluation relative to representative state-of-the-art methods reveals that QPOPSS's multi-threaded throughput scales linearly while maintaining the highest accuracy, with orders of magnitude smaller memory footprint.

Abstract

The frequent elements problem, a key component in demanding stream-data analytics, involves selecting elements whose occurrence exceeds a user-specified threshold. Fast, memory-efficient $ε$-approximate synopsis algorithms select all frequent elements but may overestimate them depending on $ε$ (user-defined parameter). Evolving applications demand performance only achievable by parallelization. However, algorithmic guarantees concerning concurrent updates and queries have been overlooked. We propose Query and Parallelism Optimized Space-Saving (QPOPSS), providing concurrency guarantees. The design includes an implementation of the \emph{Space-Saving} algorithm supporting fast queries, implying minimal overlap with concurrent updates. QPOPSS integrates this with the distribution of work and fine-grained synchronization among threads, swiftly balancing high throughput, high accuracy, and low memory consumption. Our analysis, under various concurrency and data distribution conditions, shows space and approximation bounds. Our empirical evaluation relative to representative state-of-the-art methods reveals that QPOPSS's multi-threaded throughput scales linearly while maintaining the highest accuracy, with orders of magnitude smaller memory footprint.

QPOPSS: Query and Parallelism Optimized Space-Saving for Finding Frequent Stream Elements

TL;DR

Abstract

The frequent elements problem, a key component in demanding stream-data analytics, involves selecting elements whose occurrence exceeds a user-specified threshold. Fast, memory-efficient

-approximate synopsis algorithms select all frequent elements but may overestimate them depending on

(user-defined parameter). Evolving applications demand performance only achievable by parallelization. However, algorithmic guarantees concerning concurrent updates and queries have been overlooked. We propose Query and Parallelism Optimized Space-Saving (QPOPSS), providing concurrency guarantees. The design includes an implementation of the \emph{Space-Saving} algorithm supporting fast queries, implying minimal overlap with concurrent updates. QPOPSS integrates this with the distribution of work and fine-grained synchronization among threads, swiftly balancing high throughput, high accuracy, and low memory consumption. Our analysis, under various concurrency and data distribution conditions, shows space and approximation bounds. Our empirical evaluation relative to representative state-of-the-art methods reveals that QPOPSS's multi-threaded throughput scales linearly while maintaining the highest accuracy, with orders of magnitude smaller memory footprint.

Paper Structure (25 sections, 12 theorems, 20 equations, 10 figures, 3 tables, 4 algorithms)

This paper contains 25 sections, 12 theorems, 20 equations, 10 figures, 3 tables, 4 algorithms.

Introduction
Preliminaries
Problem analysis
Global Data Structure
Thread-Local Data Structures
Accuracy and Consistency
Need for a Balancing Approach
Query and Parallelism Optimized Space-Saving
Design Overview
Auxiliary Concepts
Query Optimized Space-Saving
Concurrent Updates
Concurrent Frequent Elements Queries
Analysis
Domain Splitting and Space Requirements
...and 10 more sections

Key Result

Lemma 1

QOSS preserves the following properties (implied from the respective lemmas and theorems in metwally_efficient_2005)

Figures (10)

Figure 1: A binary min-max tree with alternating levels. The dashed arrows depict the traversal order during a QOSS query.
Figure 2: Overview of the update and query operations. Thread $t_1$ transfers full filters to the owner-threads for subsequent insertion into the reserved thread-local QOSS data structures. Queries are mutually exclusive with insertions and gather the subset of frequent elements tracked by each thread into $F$.
Figure 3: Rank and count of each unique element in the CAIDA data set. Zipf distributions with skew 0.5 and 1 are plotted as a guide. Note the logarithmic scale on x- and y-axes.
Figure 4: Throughput and query latency when QPOPSS employs QOSS or Space-Saving as the inner algorithm. Queries make up 0.01% of the operations and $\phi=10^{-4}$.
Figure 5: Throughput in million operations per second and multicore speedup of QPOPSS for different skew parameters of the synthetic Zipf data sets. The skew level varies along the x-axis, $T=24$ threads.
...and 5 more figures

Theorems & Definitions (21)

Definition 1
Definition 2
Lemma 1
Lemma 2
proof
Lemma 3
proof
Corollary 1
Theorem 1
proof
...and 11 more

QPOPSS: Query and Parallelism Optimized Space-Saving for Finding Frequent Stream Elements

TL;DR

Abstract

QPOPSS: Query and Parallelism Optimized Space-Saving for Finding Frequent Stream Elements

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (21)