Compression and In-Situ Query Processing for Fine-Grained Array Lineage
Jinjin Zhao, Sanjay Krishnan
TL;DR
DSLog addresses the challenge of storing fine-grained array lineage by introducing ProvRC, a compression algorithm that exploits spatial regularity via Multi-Attribute Range Encoding and Relative Value Transformation. It supports in-situ query processing directly over compressed lineage for forward and backward queries, avoiding decompression and delivering large performance gains. The approach uses a normalized relational model with index reshaping to enable cross-operation reuse, achieving substantial storage reductions (down to around $0.3\%$) and dramatic query-latency improvements (up to $20\times$ in some pipelines) on array workloads up to $10^6$ cells. Together, capture, compression, reuse, and in-situ querying enable practical, scalable fine-grained provenance for data science workflows, including integration with numpy and real-world data pipelines.
Abstract
Tracking data lineage is important for data integrity, reproducibility, and debugging data science workflows. However, fine-grained lineage (i.e., at a cell level) is challenging to store, even for the smallest datasets. This paper introduces DSLog, a storage system that efficiently stores, indexes, and queries array data lineage, agnostic to capture methodology. A main contribution is our new compression algorithm, named ProvRC, that compresses captured lineage relationships. Using ProvRC for lineage compression result in a significant storage reduction over functions with simple spatial regularity, beating alternative columnar-store baselines by up to 2000x}. We also show that ProvRC facilitates in-situ query processing that allows forward and backward lineage queries without decompression - in the optimal case, surpassing baselines by 20x in query latency on random numpy pipelines.
