Table of Contents
Fetching ...

In-Memory Indexing and Querying of Provenance in Data Preparation Pipelines

Khalid Belhajjame, Haroun Mezrioui, Yuyan Zhao

TL;DR

This paper introduces TensProv, an in-memory tensor-based system for capturing and querying provenance in data preparation pipelines. It encodes fine-grained provenance at the data-record level and augments it with attribute-level metadata to enable inference of cell-level dependencies, all while maintaining memory efficiency through sparse representations and a tree-like provenance structure. The approach supports a broad set of provenance queries, both forward and backward, including co-contributory and co-dependency analyses, with efficient in-memory operations such as slicing and projection. Empirical evaluations across real and synthetic pipelines demonstrate substantial memory savings, low capture overhead, and milliseconds-scale query performance, outperforming disk-based and prior in-memory approaches, particularly for join-heavy workloads. The work shows practical potential for real-time debugging, fairness assessment, and data quality tracing during pipeline development.

Abstract

Data provenance has numerous applications in the context of data preparation pipelines. It can be used for debugging faulty pipelines, interpreting results, verifying fairness, and identifying data quality issues, which may affect the sources feeding the pipeline execution. In this paper, we present an indexing mechanism to efficiently capture and query pipeline provenance. Our solution leverages tensors to capture fine-grained provenance of data processing operations, using minimal memory. In addition to record-level lineage relationships, we provide finer granularity at the attribute level. This is achieved by augmenting tensors, which capture retrospective provenance, with prospective provenance information, drawing connections between input and output schemas of data processing operations. We demonstrate how these two types of provenance (retrospective and prospective) can be combined to answer a broad range of provenance queries efficiently, and show effectiveness through evaluation exercises using both real and synthetic data.

In-Memory Indexing and Querying of Provenance in Data Preparation Pipelines

TL;DR

This paper introduces TensProv, an in-memory tensor-based system for capturing and querying provenance in data preparation pipelines. It encodes fine-grained provenance at the data-record level and augments it with attribute-level metadata to enable inference of cell-level dependencies, all while maintaining memory efficiency through sparse representations and a tree-like provenance structure. The approach supports a broad set of provenance queries, both forward and backward, including co-contributory and co-dependency analyses, with efficient in-memory operations such as slicing and projection. Empirical evaluations across real and synthetic pipelines demonstrate substantial memory savings, low capture overhead, and milliseconds-scale query performance, outperforming disk-based and prior in-memory approaches, particularly for join-heavy workloads. The work shows practical potential for real-time debugging, fairness assessment, and data quality tracing during pipeline development.

Abstract

Data provenance has numerous applications in the context of data preparation pipelines. It can be used for debugging faulty pipelines, interpreting results, verifying fairness, and identifying data quality issues, which may affect the sources feeding the pipeline execution. In this paper, we present an indexing mechanism to efficiently capture and query pipeline provenance. Our solution leverages tensors to capture fine-grained provenance of data processing operations, using minimal memory. In addition to record-level lineage relationships, we provide finer granularity at the attribute level. This is achieved by augmenting tensors, which capture retrospective provenance, with prospective provenance information, drawing connections between input and output schemas of data processing operations. We demonstrate how these two types of provenance (retrospective and prospective) can be combined to answer a broad range of provenance queries efficiently, and show effectiveness through evaluation exercises using both real and synthetic data.

Paper Structure

This paper contains 44 sections, 11 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Tensor representation.
  • Figure 2: Elements of the solution put together.
  • Figure 3: Overhead time due to provenance capture in the three uses cases.
  • Figure 4: The processing time for each provenance query shown in Table \ref{['tab:provenance_queries']}.
  • Figure 5: The processing time for queries when necessitating recomputation
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition 1: Provenance of a Data Processing Operation