Table of Contents
Fetching ...

In-memory Incremental Maintenance of Provenance Sketches [extended version]

Pengyuan Li, Boris Glavic, Dieter Gawlick, Vasudha Krishnaswamy, Zhen Hua Liu, Danica Porobic, Xing Niu

TL;DR

This work introduces IMP, an in‑memory incremental engine for maintaining provenance sketches that over‑approximate query provenance to enable data skipping. IMP models data as sketch‑annotated tuples, derives incremental rules for relational operators, and supports eager and lazy maintenance with optimizations like delta filtering and bloom filters. The approach is proven correct through a formal framework and validated experimentally, showing orders‑of‑magnitude improvements over full maintenance across mixed workloads, TPC‑H, and real datasets. The results demonstrate that provenance sketches can be kept up‑to‑date efficiently, enabling scalable provenance‑based data skipping in dynamic environments with updates.

Abstract

Provenance-based data skipping compactly over-approximates the provenance of a query using so-called provenance sketches and utilizes such sketches to speed-up the execution of subsequent queries by skipping irrelevant data. However, a sketch captured at some time in the past may become stale if the data has been updated subsequently. Thus, there is a need to maintain provenance sketches. In this work, we introduce In-Memory incremental Maintenance of Provenance sketches (IMP), a framework for maintaining sketches incrementally under updates. At the core of IMP is an incremental query engine for data annotated with sketches that exploits the coarse-grained nature of sketches to enable novel optimizations. We experimentally demonstrate that IMP significantly reduces the cost of sketch maintenance, thereby enabling the use of provenance sketches for a broad range of workloads that involve updates.

In-memory Incremental Maintenance of Provenance Sketches [extended version]

TL;DR

This work introduces IMP, an in‑memory incremental engine for maintaining provenance sketches that over‑approximate query provenance to enable data skipping. IMP models data as sketch‑annotated tuples, derives incremental rules for relational operators, and supports eager and lazy maintenance with optimizations like delta filtering and bloom filters. The approach is proven correct through a formal framework and validated experimentally, showing orders‑of‑magnitude improvements over full maintenance across mixed workloads, TPC‑H, and real datasets. The results demonstrate that provenance sketches can be kept up‑to‑date efficiently, enabling scalable provenance‑based data skipping in dynamic environments with updates.

Abstract

Provenance-based data skipping compactly over-approximates the provenance of a query using so-called provenance sketches and utilizes such sketches to speed-up the execution of subsequent queries by skipping irrelevant data. However, a sketch captured at some time in the past may become stale if the data has been updated subsequently. Thus, there is a need to maintain provenance sketches. In this work, we introduce In-Memory incremental Maintenance of Provenance sketches (IMP), a framework for maintaining sketches incrementally under updates. At the core of IMP is an incremental query engine for data annotated with sketches that exploits the coarse-grained nature of sketches to enable novel optimizations. We experimentally demonstrate that IMP significantly reduces the cost of sketch maintenance, thereby enabling the use of provenance sketches for a broad range of workloads that involve updates.

Paper Structure

This paper contains 72 sections, 12 theorems, 96 equations, 22 figures.

Key Result

theorem 1

$\mathcal{I}$ as defined in sec:problem_definition is an incremental maintenance procedure such that it takes as input a state $\mathcal{S}\xspace$, the annotated delta $\Delta \mathscr{D}\xspace\xspace$, the ranges $\Phi\xspace$, a query $Q\xspace$ and returns an updated state $\mathcal{S}\xspace'$

Figures (22)

  • Figure 1: Example query and relevant subsets of the database.
  • Figure 2: IMP manages a set of sketches. For each incoming query, IMP determines whether to (i) capture a new sketches, (ii) use an existing non-stale sketch, or (iii) incrementally maintain a stale sketch and then utilize the updated sketch to answer the query.
  • Figure 3: Glossary
  • Figure 4: Bag Relational Algebra
  • Figure 5: Using our to evaluate a query under incremental annotated semantics.
  • ...and 17 more figures

Theorems & Definitions (25)

  • Example 1.1
  • Example 1.2: Stale Sketches
  • Definition 4.1: Range partition
  • Definition 4.2: Provenance Sketch
  • Example 4.1
  • Definition 4.3: Sketch Annotated Relation
  • Definition 4.4: Annotating Relations
  • Example 4.2
  • Definition 4.5: Incremental Maintenance Procedure
  • Example 5.1
  • ...and 15 more