Table of Contents
Fetching ...

Compression and In-Situ Query Processing for Fine-Grained Array Lineage

Jinjin Zhao, Sanjay Krishnan

TL;DR

DSLog addresses the challenge of storing fine-grained array lineage by introducing ProvRC, a compression algorithm that exploits spatial regularity via Multi-Attribute Range Encoding and Relative Value Transformation. It supports in-situ query processing directly over compressed lineage for forward and backward queries, avoiding decompression and delivering large performance gains. The approach uses a normalized relational model with index reshaping to enable cross-operation reuse, achieving substantial storage reductions (down to around $0.3\%$) and dramatic query-latency improvements (up to $20\times$ in some pipelines) on array workloads up to $10^6$ cells. Together, capture, compression, reuse, and in-situ querying enable practical, scalable fine-grained provenance for data science workflows, including integration with numpy and real-world data pipelines.

Abstract

Tracking data lineage is important for data integrity, reproducibility, and debugging data science workflows. However, fine-grained lineage (i.e., at a cell level) is challenging to store, even for the smallest datasets. This paper introduces DSLog, a storage system that efficiently stores, indexes, and queries array data lineage, agnostic to capture methodology. A main contribution is our new compression algorithm, named ProvRC, that compresses captured lineage relationships. Using ProvRC for lineage compression result in a significant storage reduction over functions with simple spatial regularity, beating alternative columnar-store baselines by up to 2000x}. We also show that ProvRC facilitates in-situ query processing that allows forward and backward lineage queries without decompression - in the optimal case, surpassing baselines by 20x in query latency on random numpy pipelines.

Compression and In-Situ Query Processing for Fine-Grained Array Lineage

TL;DR

DSLog addresses the challenge of storing fine-grained array lineage by introducing ProvRC, a compression algorithm that exploits spatial regularity via Multi-Attribute Range Encoding and Relative Value Transformation. It supports in-situ query processing directly over compressed lineage for forward and backward queries, avoiding decompression and delivering large performance gains. The approach uses a normalized relational model with index reshaping to enable cross-operation reuse, achieving substantial storage reductions (down to around ) and dramatic query-latency improvements (up to in some pipelines) on array workloads up to cells. Together, capture, compression, reuse, and in-situ querying enable practical, scalable fine-grained provenance for data science workflows, including integration with numpy and real-world data pipelines.

Abstract

Tracking data lineage is important for data integrity, reproducibility, and debugging data science workflows. However, fine-grained lineage (i.e., at a cell level) is challenging to store, even for the smallest datasets. This paper introduces DSLog, a storage system that efficiently stores, indexes, and queries array data lineage, agnostic to capture methodology. A main contribution is our new compression algorithm, named ProvRC, that compresses captured lineage relationships. Using ProvRC for lineage compression result in a significant storage reduction over functions with simple spatial regularity, beating alternative columnar-store baselines by up to 2000x}. We also show that ProvRC facilitates in-situ query processing that allows forward and backward lineage queries without decompression - in the optimal case, surpassing baselines by 20x in query latency on random numpy pipelines.
Paper Structure (34 sections, 13 equations, 9 figures, 10 tables)

This paper contains 34 sections, 13 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: We present a typical operation that sums over the second axis of an array (A), the relational representation of such operation (B), and an array visualization of the lineage relation described by the first row of the relational representation (C).
  • Figure 2: This figure shows an example of multi-attribute range encoding, where the all-to-all relationship between four cells in one array and one cell (A) in another array can be represented succinctly with ranges (B).
  • Figure 3: This is an example of relative transformation and output range encoding, where, given the one-to-one relationship between two 2x1 arrays (A), we can perform a relative indices transformation on the first array (B), introducing new range compression opportunities for efficient encoding (C).
  • Figure 4: This is an example of a range join preserves lineage, where, given a lineage table compressed with (A), we can identify the intersections between the query and the table (B) and then get the full lineage of those intersecting intervals (C).
  • Figure 5: We have an example of how we can find absolute indices from relative range intervals, where, given a lineage table compressed with (A), we can perform a range join with a query (B), and then calculate the absolute indices from the resulting table (while breaking the all-to-all relationship) (C).
  • ...and 4 more figures