Table of Contents
Fetching ...

Towards dimensions and granularity in a unified workflow and data provenance framework

Tanja Auge, Sascha Genehr, Meike Klettke and, Frank Krüger, Max Schröder

TL;DR

This paper tackles the need for full traceability by unifying workflow provenance and data provenance and by extending the W7 provenance questions to W7+1. It presents a conceptual framework that encodes workflow provenance as graphs (PROV-O) and data provenance at the file or tuple level, with dimensions including retrospective, prospective, and evolution, and with fine- to coarse-grained granularity driven by the seven provenance questions. The biomedical use case illustrates how in-vitro measurements and in-silico simulations can be linked via common provenance representations, including examples such as provenance polynomials $r_1 \cdot s_1 + r_1 \cdot s_3$ and their witness bases. The work serves as a stepping stone toward a formal specification of a unified provenance framework to improve credibility and reproducibility across scientific domains.

Abstract

Provenance information are essential for the traceability of scientific studies or experiments and thus crucial for ensuring the credibility and reproducibility of research findings. This paper discusses a comprehensive provenance framework combining the two types 1. workflow provenance, and 2. data provenance as well as their dimensions and granularity, which enables the answering of W7+1 provenance questions. We demonstrate the applicability by employing a biomedical research use case, that can be easily transferred into other scientific fields. An integration of these concepts into a unified framework enables credibility and reproducibility of the research findings.

Towards dimensions and granularity in a unified workflow and data provenance framework

TL;DR

This paper tackles the need for full traceability by unifying workflow provenance and data provenance and by extending the W7 provenance questions to W7+1. It presents a conceptual framework that encodes workflow provenance as graphs (PROV-O) and data provenance at the file or tuple level, with dimensions including retrospective, prospective, and evolution, and with fine- to coarse-grained granularity driven by the seven provenance questions. The biomedical use case illustrates how in-vitro measurements and in-silico simulations can be linked via common provenance representations, including examples such as provenance polynomials and their witness bases. The work serves as a stepping stone toward a formal specification of a unified provenance framework to improve credibility and reproducibility across scientific domains.

Abstract

Provenance information are essential for the traceability of scientific studies or experiments and thus crucial for ensuring the credibility and reproducibility of research findings. This paper discusses a comprehensive provenance framework combining the two types 1. workflow provenance, and 2. data provenance as well as their dimensions and granularity, which enables the answering of W7+1 provenance questions. We demonstrate the applicability by employing a biomedical research use case, that can be easily transferred into other scientific fields. An integration of these concepts into a unified framework enables credibility and reproducibility of the research findings.

Paper Structure

This paper contains 8 sections, 3 figures.

Figures (3)

  • Figure 1: Biomedical use case with two experiment types (in-vitro and in-silico) whose research findings are used to validate and optimize the other. Activities are numbered to illustrate the trajectory.
  • Figure 2: Biomedical in-vitro experiment with organizations and persons (agents), cell cultures and samples as well as data sets (entities), and research activities (activity) including their relationships in the PROV standard. Note that PROV relationship direction is typically from the result to the origin.
  • Figure 3: W7+1 provenance questions for workflow provenance ($\--\!\hbox{$\Box$}$) and data provenance ($\multimap$) including typical (solid) and new defined (dotted lines) answer options.