In-memory Incremental Maintenance of Provenance Sketches [extended version]
Pengyuan Li, Boris Glavic, Dieter Gawlick, Vasudha Krishnaswamy, Zhen Hua Liu, Danica Porobic, Xing Niu
TL;DR
This work introduces IMP, an in‑memory incremental engine for maintaining provenance sketches that over‑approximate query provenance to enable data skipping. IMP models data as sketch‑annotated tuples, derives incremental rules for relational operators, and supports eager and lazy maintenance with optimizations like delta filtering and bloom filters. The approach is proven correct through a formal framework and validated experimentally, showing orders‑of‑magnitude improvements over full maintenance across mixed workloads, TPC‑H, and real datasets. The results demonstrate that provenance sketches can be kept up‑to‑date efficiently, enabling scalable provenance‑based data skipping in dynamic environments with updates.
Abstract
Provenance-based data skipping compactly over-approximates the provenance of a query using so-called provenance sketches and utilizes such sketches to speed-up the execution of subsequent queries by skipping irrelevant data. However, a sketch captured at some time in the past may become stale if the data has been updated subsequently. Thus, there is a need to maintain provenance sketches. In this work, we introduce In-Memory incremental Maintenance of Provenance sketches (IMP), a framework for maintaining sketches incrementally under updates. At the core of IMP is an incremental query engine for data annotated with sketches that exploits the coarse-grained nature of sketches to enable novel optimizations. We experimentally demonstrate that IMP significantly reduces the cost of sketch maintenance, thereby enabling the use of provenance sketches for a broad range of workloads that involve updates.
