Structured Downsampling for Fast, Memory-efficient Curation of Online Data Streams

Matthew Andres Moreno; Luis Zaman; Emily Dolson

Structured Downsampling for Fast, Memory-efficient Curation of Online Data Streams

Matthew Andres Moreno, Luis Zaman, Emily Dolson

TL;DR

This work addresses the problem of extracting a fixed-capacity, rolling subsample from a data stream, and explores ``data stream curation'' strategies to fulfill requirements on the composition of sample time points retained.

Abstract

Operations over data streams typically hinge on efficient mechanisms to aggregate or summarize history on a rolling basis. For high-volume data steams, it is critical to manage state in a manner that is fast and memory efficient -- particularly in resource-constrained or real-time contexts. Here, we address the problem of extracting a fixed-capacity, rolling subsample from a data stream. Specifically, we explore ``data stream curation'' strategies to fulfill requirements on the composition of sample time points retained. Our ``DStream'' suite of algorithms targets three temporal coverage criteria: (1) steady coverage, where retained samples should spread evenly across elapsed data stream history; (2) stretched coverage, where early data items should be proportionally favored; and (3) tilted coverage, where recent data items should be proportionally favored. For each algorithm, we prove worst-case bounds on rolling coverage quality. We focus on the more practical, application-driven case of maximizing coverage quality given a fixed memory capacity. As a core simplifying assumption, we restrict algorithm design to a single update operation: writing from the data stream to a calculated buffer site -- with data never being read back, no metadata stored (e.g., sample timestamps), and data eviction occurring only implicitly via overwrite. Drawing only on primitive, low-level operations and ensuring full, overhead-free use of available memory, this ``DStream'' framework ideally suits domains that are resource-constrained, performance-critical, and fine-grained (e.g., individual data items as small as single bits or bytes). The proposed approach supports $\mathcal{O}(1)$ data ingestion via concise bit-level operations. To further practical applications, we provide plug-and-play open-source implementations targeting both scripted and compiled application domains.

Structured Downsampling for Fast, Memory-efficient Curation of Online Data Streams

TL;DR

Abstract

data ingestion via concise bit-level operations. To further practical applications, we provide plug-and-play open-source implementations targeting both scripted and compiled application domains.

Paper Structure (59 sections, 20 theorems, 105 equations, 9 figures, 1 table, 6 algorithms)

This paper contains 59 sections, 20 theorems, 105 equations, 9 figures, 1 table, 6 algorithms.

Introduction
Stream Curation Problem
Applications of Stream Curation
Prior Work
Proposed Approach
Major Results
Preliminaries, Notations, and Terminology
Buffer Storage $\mathcolor{blue}{S}$
Logical Time $\mathcolor{red}{T}$ and Item Ingest Time $\mathcolor{red}{{\overset{ }{T}}}$
Gap Size $\mathcolor{teal}{g}$
Time Hanoi Value $\mathcolor{violet}{h}$
Time Epoch $\mathcolor{VioletRed}{\mathrm{t}}$
Site Reservations $\mathcolor{violet}{\mathcal{H}}_{\mathcolor{VioletRed}{\mathrm{t}}}(\mathcolor{purple}{k})$
Time Meta-epoch $\mathcolor{orange}{{ { \hbox{$\m@th\tau$} } }}$
Restrictions on Logical Time $\mathcolor{red}{T}$, Epoch $\mathcolor{VioletRed}{\mathrm{t}}$, and Meta-epoch $\mathcolor{orange}{{ { \hbox{$\m@th\tau$} } }}$
...and 44 more sections

Key Result

Theorem 4.1

Under the steady curation algorithm, where $\hat{\mathcolor{teal}{g}}$ is the optimal lower bound on $\mathsf{cost\_steady}(\mathcolor{red}{T})$ given in Equation eqn:steady-optimal-gap-size.

Figures (9)

Figure 1: Surveyed target coverage criteria.Ideal distributions of ingestion time points for retained data items under each criterion are shown at $\mathcolor{red}{T}=50$ (bottom) and $\mathcolor{red}{T}=100$ (top). Vertical bars represent a retained data item. In this illustration, collection size is 12 retained items. All other ingested data items have been discarded. The steady criterion (\ref{['fig:criteria-intuition-steady']}) seeks to minimize largest absolute gap size. So, ideal retention maintains items spread evenly across data stream history. The stretched criterion (\ref{['fig:criteria-intuition-stretched']}) calls for greater retention of early data items to minimize gap size proportional to data item ingestion time $\mathcolor{red}{{\overset{ }{T}}}$. In contrast, under the tilted criterion (\ref{['fig:criteria-intuition-tilted']}) recency-proportional gap size is to be minimized, necessitating over-retention of recent data items.
Figure 2: Core stream curation algorithm operations.The ingest site selection operation (operation shown as item "a") takes the current time $\mathcolor{red}{T}$ and determines the buffer site $\mathcolor{purple}{k}$ to store the ingested data item. Data items may also be discarded without storage, as are $\mathcolor{red}{{\overset{ }{T}}}=4$ and $\mathcolor{red}{{\overset{ }{T}}}=6$ in this example. This operation is performed when storing data into a curated buffer, once for each data item received from the data stream. Data is not moved after it is stored. The ingested time calculation operation (operation shown as item "b") provides the previous time $\mathcolor{red}{{\overset{ }{T}}}$ when the data item present at buffer site $\mathcolor{purple}{k}$ was ingested, given the current time $\mathcolor{red}{T}$. This operation is performed when reading data from a curated buffer in order to identify the provenance of stored data. Note that which data item $\mathcolor{red}{{\overset{ }{T}}}$ occupies a buffer site $\mathcolor{purple}{k}$ at time $\mathcolor{red}{T}$ results solely from the sequence of ingest storage sites selected up to that point. As such, the site lookup operation $\mathcolor{purple}{\mathrm{L}}$ can be considered, in a loose sense, as "decoding" or "inverse" to the site selection operation $\mathcolor{purple}{\mathrm{K}}$. Panels with diamond markers on the right show curated collection composition at $\mathcolor{red}{T}=4$ and $\mathcolor{red}{T}=8$. Figure \ref{['fig:criteria-intuition']} shows the target curated collection compositions considered in this work.
Figure 3: Hanoi value retention strategies.Data item retention can be prioritized based on "hanoi value" of ingestion time $\mathcolor{red}{T}$. Here, "lollipop" bars show data item hanoi values, $\mathcolor{violet}{\mathrm{H}}(\mathcolor{red}{{\overset{ }{T}}})$. To satisfy the steady criterion, our proposed strategy discards data items with h.v. below a threshold $n(\mathcolor{red}{T})$ (\ref{['fig:hanoi-intuition-steady']}). Red arrows show the threshold $n$ increasing as time elapses, purging low h.v. data items to respect available buffer space. Our strategy for the stretched criterion retains the first $n'(\mathcolor{red}{T})$ data item instances of all observed h.v. 's (\ref{['fig:hanoi-intuition-stretched']}). As time elapses, $n'(\mathcolor{red}{T})$ is halved across h.v. 's' in a rolling fashion --- also shown by red arrows above. Our strategy to satisfy the tilted criterion operates similarly to the stretched strategy, except the last$n'(\mathcolor{red}{T})$ data item instances of each h.v. are retained (\ref{['fig:hanoi-intuition-tilted']}). The bottom and top panels compare example retention at $\mathcolor{red}{T}=50$ and $\mathcolor{red}{T}=100$, respectively. Green boxes indicate retained data items.
Figure 4: Steady algorithm strategy.Top panel \ref{['fig:hsurf-steady-intuition-diagram']} shows sites selected for items with h.v. $\mathcolor{violet}{h}=6$ from their first occurrence during epoch $\mathcolor{VioletRed}{\mathrm{t}}=2$ to epoch $\mathcolor{VioletRed}{\mathrm{t}}=7$, when stored instances of that h.v. are overwritten. Memory buffer sites are shown across the bottom of the schematic. Data items' vertical span stretches across time from the epoch when they are stored to the epoch when they are overwritten. The first data item with hanoi value $\mathcolor{violet}{\mathrm{H}}(\mathcolor{red}{T}) = \mathcolor{violet}{h}$ is placed in bunch 0 during epoch $\mathcolor{VioletRed}{\mathrm{t}}=\mathcolor{violet}{h}-4$. The next data item with h.v. $\mathcolor{violet}{h}$ is encountered in the following epoch, and it is placed in bunch 1. In epoch $\mathcolor{VioletRed}{\mathrm{t}}=\mathcolor{violet}{h}-2$, two data items with h.v. $\mathcolor{violet}{h}$ are encountered and placed into segments within bunch 2. Epoch $\mathcolor{VioletRed}{\mathrm{t}}=\mathcolor{violet}{h}-1$, encounters 4 data items with h.v. $\mathcolor{violet}{h}-1$ places them in bunch 3's segments. In epoch $\mathcolor{VioletRed}{\mathrm{t}}=\mathcolor{violet}{h}$, eight h.v. $\mathcolor{violet}{h}$ data items (twice as many) are encountered. We place them in bunch 4's one-site segments. Finally, during epoch $\mathcolor{VioletRed}{\mathrm{t}}=\mathcolor{violet}{h}+1$, all further ingested data items with h.v. $\mathcolor{violet}{h}$ are discarded and all existing stored h.v. $\mathcolor{violet}{h}$ items are overwritten. In this manner, data items with highest h.v. are retained on a rolling basis to provide uniformly-spaced gaps --- as laid out in Figure \ref{['fig:hanoi-intuition-steady']}. Bottom panel \ref{['fig:hsurf-steady-intuition-heatmap']} shows h.v. site reservations $\mathcolor{violet}{\mathcal{H}}_{\mathcolor{VioletRed}{\mathrm{t}}}(\mathcolor{purple}{k})$ from epoch $\mathcolor{VioletRed}{\mathrm{t}}=0$ through $\mathcolor{VioletRed}{\mathrm{t}}=5$ with buffer size $\mathcolor{blue}{S}=16$. Numbering/color coding corresponds to which h.v. a site is reserved for. Black dividers separate bunches; white space divides segments within bunches. Annotations highlight the lifecycle of data items with h.v. $\mathcolor{violet}{h}=6$.
Figure 5: Steady algorithm implementation.Top panel \ref{['fig:hsurf-steady-implementation-site-selection']} enumerates initial steady policy site selection on a 32-site buffer. Panel \ref{['fig:hsurf-steady-implementation-schematic']} summarizes how data items are ingested and retained over time within a 32-site buffer, color-coded by data items' hanoi values $\mathcolor{violet}{\mathrm{H}}(\mathcolor{red}{T})$. Between $\mathcolor{red}{T}=0$ and $\mathcolor{red}{T}=126$, time is segmented into epochs $\mathcolor{VioletRed}{\mathrm{t}}=0$, $\mathcolor{VioletRed}{\mathrm{t}}=1$, and $\mathcolor{VioletRed}{\mathrm{t}}=2$; strips before each epoch show hanoi values assigned to each buffer site during that epoch. Time increases along the $y$ axis. Rectangles with small white "$\blkhorzoval$" symbol denote buffer site where the ingested data item from each timestep $\mathcolor{red}{T}$ is placed. Buffer space is split into "reservation segments." Reservation segments occur in five "bunches" --- (1) one 6-site segment, (2) one 4-site segment, (3) two 3-site segments, (4) four 2-site segments, and (5) eight 1-site segments. At each epoch, data items are filled into sites newly assigned for their ingestion-order hanoi value from left to right. In epoch $\mathcolor{VioletRed}{\mathrm{t}}=0$, all sites are filled with a first data item. During each subsequent epoch $\mathcolor{VioletRed}{\mathrm{t}}>0$, segments within bunch $i$ each accept one data item with h.v. $\mathcolor{violet}{h}=\mathcolor{VioletRed}{\mathrm{t}} + \mathcolor{blue}{\hat{\mathrm{s}}} - 1 - i$. All newly-assigned sites were previously assigned to the overall now-lowest hanoi value $\mathcolor{violet}{h}=\mathcolor{VioletRed}{\mathrm{t}} - 1$. In this way, all instances of the overall lowest hanoi value are overwritten each epoch. Heatmap panel \ref{['fig:hsurf-steady-implementation-heatmap']} shows the evolution of data item age at each site on a 256-bit field over the course of 4,096 time steps. Dripplot panel \ref{['fig:hsurf-steady-implementation-dripplot']} shows retention spans for 3,000 ingested time points. Vertical lines span durations between ingestion and elimination for data items from successive time points. Time points previously eliminated are marked in red. Lineplot panel \ref{['fig:hsurf-steady-implementation-satisfaction']} shows steady criterion satisfaction on a 16-bit surface over $2^{16}$ timepoints. Lower and upper shaded areas are best- and worst-case bounds, respectively.
...and 4 more figures

Theorems & Definitions (42)

Theorem 4.1: Steady algorithm gap size upper bound
proof
Lemma 5.1: Meta-epochs $\mathcolor{orange}{{ { \hbox{$\m@th\tau$} } }}$ correspond to segment subsumption cycles
proof
Theorem 5.1: Stretched algorithm gap size ratio upper bound
proof
Theorem 6.1: Tilted algorithm gap size ratio upper bound
proof
Lemma S3.1: Current meta-epoch upper bounds
proof
...and 32 more

Structured Downsampling for Fast, Memory-efficient Curation of Online Data Streams

TL;DR

Abstract

Structured Downsampling for Fast, Memory-efficient Curation of Online Data Streams

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (42)