Towards dimensions and granularity in a unified workflow and data provenance framework
Tanja Auge, Sascha Genehr, Meike Klettke and, Frank Krüger, Max Schröder
TL;DR
This paper tackles the need for full traceability by unifying workflow provenance and data provenance and by extending the W7 provenance questions to W7+1. It presents a conceptual framework that encodes workflow provenance as graphs (PROV-O) and data provenance at the file or tuple level, with dimensions including retrospective, prospective, and evolution, and with fine- to coarse-grained granularity driven by the seven provenance questions. The biomedical use case illustrates how in-vitro measurements and in-silico simulations can be linked via common provenance representations, including examples such as provenance polynomials $r_1 \cdot s_1 + r_1 \cdot s_3$ and their witness bases. The work serves as a stepping stone toward a formal specification of a unified provenance framework to improve credibility and reproducibility across scientific domains.
Abstract
Provenance information are essential for the traceability of scientific studies or experiments and thus crucial for ensuring the credibility and reproducibility of research findings. This paper discusses a comprehensive provenance framework combining the two types 1. workflow provenance, and 2. data provenance as well as their dimensions and granularity, which enables the answering of W7+1 provenance questions. We demonstrate the applicability by employing a biomedical research use case, that can be easily transferred into other scientific fields. An integration of these concepts into a unified framework enables credibility and reproducibility of the research findings.
