Table of Contents
Fetching ...

Recording provenance of workflow runs with RO-Crate

Simone Leo, Michael R. Crusoe, Laura Rodríguez-Navas, Raül Sirvent, Alexander Kanitz, Paul De Geest, Rudolf Wittner, Luca Pireddu, Daniel Garijo, José M. Fernández, Iacopo Colonnelli, Matej Gallo, Tazro Ohta, Hirotaka Suetake, Salvador Capella-Gutierrez, Renske de Wit, Bruno P. Kinoshita, Stian Soiland-Reyes

TL;DR

This work presents Workflow Run RO-Crate, an extension of RO-Crate and Schema.org to capture the provenance of the execution of computational workflows at different levels of granularity and bundle together all their associated objects.

Abstract

Recording the provenance of scientific computation results is key to the support of traceability, reproducibility and quality assessment of data products. Several data models have been explored to address this need, providing representations of workflow plans and their executions as well as means of packaging the resulting information for archiving and sharing. However, existing approaches tend to lack interoperable adoption across workflow management systems. In this work we present Workflow Run RO-Crate, an extension of RO-Crate (Research Object Crate) and Schema.org to capture the provenance of the execution of computational workflows at different levels of granularity and bundle together all their associated objects (inputs, outputs, code, etc.). The model is supported by a diverse, open community that runs regular meetings, discussing development, maintenance and adoption aspects. Workflow Run RO-Crate is already implemented by several workflow management systems, allowing interoperable comparisons between workflow runs from heterogeneous systems. We describe the model, its alignment to standards such as W3C PROV, and its implementation in six workflow systems. Finally, we illustrate the application of Workflow Run RO-Crate in two use cases of machine learning in the digital image analysis domain. A corresponding RO-Crate for this article is at https://w3id.org/ro/doi/10.5281/zenodo.10368989

Recording provenance of workflow runs with RO-Crate

TL;DR

This work presents Workflow Run RO-Crate, an extension of RO-Crate and Schema.org to capture the provenance of the execution of computational workflows at different levels of granularity and bundle together all their associated objects.

Abstract

Recording the provenance of scientific computation results is key to the support of traceability, reproducibility and quality assessment of data products. Several data models have been explored to address this need, providing representations of workflow plans and their executions as well as means of packaging the resulting information for archiving and sharing. However, existing approaches tend to lack interoperable adoption across workflow management systems. In this work we present Workflow Run RO-Crate, an extension of RO-Crate (Research Object Crate) and Schema.org to capture the provenance of the execution of computational workflows at different levels of granularity and bundle together all their associated objects (inputs, outputs, code, etc.). The model is supported by a diverse, open community that runs regular meetings, discussing development, maintenance and adoption aspects. Workflow Run RO-Crate is already implemented by several workflow management systems, allowing interoperable comparisons between workflow runs from heterogeneous systems. We describe the model, its alignment to standards such as W3C PROV, and its implementation in six workflow systems. Finally, we illustrate the application of Workflow Run RO-Crate in two use cases of machine learning in the digital image analysis domain. A corresponding RO-Crate for this article is at https://w3id.org/ro/doi/10.5281/zenodo.10368989
Paper Structure (23 sections, 5 figures, 3 tables)

This paper contains 23 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: UML class diagram for Process Run Crate. The central class is the https://schema.org/CreateAction, which represents the execution of an application. It links to the application itself via https://schema.org/instrument, to the entity that executed it via https://schema.org/agent, and to its inputs and outputs via https://schema.org/object and https://schema.org/result, respectively. In this and following figures, classes and properties are shown with prefixes to indicate their origin. Some inputs (and, less commonly, outputs) are not stored as files or directories, but passed to the application (e.g., via a command line interface) as values of various types (e.g., a number or string). In this case, the profile recommends a representation via https://schema.org/PropertyValue. For simplicity, we left out the rest of the RO-Crate structure (e.g. the root https://schema.org/Dataset), and attributes (e.g. https://schema.org/startTime, https://schema.org/endTime, https://schema.org/description, https://schema.org/actionStatus). In this UML class notation, diamond $\Diamond$ arrows indicate aggregation and regular arrows indicate references, $*$ indicates zero or more occurrences, $1$ means single occurrence.
  • Figure 2: Diagram of a simple workflow where the head and sort programs were run manually by a user. The executions of the individual software programs are connected by the fact that the file output by head was used as input for sort, documenting the computational flow in an implicit way. Such executions can be represented with Process Run Crate.
  • Figure 3: UML class diagram for Workflow Run Crate. The main differences with Process Run Crate are the representation of formal parameters and the fact that the workflow is expected to be an entity with types https://schema.org/MediaObject (File in RO-Crate JSON-LD), https://schema.org/SoftwareSourceCode and https://bioschemas.org/ComputationalWorkflow. Effectively, the workflow belongs to all three types, and its properties are the union of the properties of the individual types. In this profile, the execution history (retrospective provenance) is augmented by a (prospective) workflow definition, giving a high-level overview of the workflow and its input and output parameter definitions (https://bioschemas.org/FormalParameter). The inner structure of the workflow is not represented in this profile. In the provenance part, individual files (https://schema.org/MediaObject) or arguments (https://schema.org/PropertyValue) are then connected to the parameters they realise. Most workflow systems can consume and produce multiple files, and this mechanism helps to declare each file's role in the workflow execution. The filled diamond $\blacklozenge$ indicates composition, empty diamond $\Diamond$ aggregation, and other arrows relations.
  • Figure 4: UML class diagram for Provenance Run Crate. In addition to the workflow run, this profile represents the execution of individual steps and their related tools. The prospective side (the execution plan) is shown by the workflow listing a series of https://schema.org/HowToSteps, each linking to the https://schema.org/SoftwareApplication that is to be executed. The https://bioschemas.org/properties/input and https://bioschemas.org/properties/output parameters for each tool are described in a similar way to the overall workflow parameter in Fig \ref{['fig:workflow_crate_er']}. The retrospective provenance side of this profile includes each tool execution as an additional https://schema.org/CreateAction with similar mapping to the realised parameters as https://schema.org/MediaObject or https://schema.org/PropertyValue, allowing intermediate values to be included in the RO-Crate even if they are not workflow outputs. The workflow execution is described the same as in the Workflow Run Crate profile with an overall https://schema.org/CreateAction (the workflow outputs will typically also appear as outputs from inner tool executions). An additional https://schema.org/OrganizeAction represents the workflow engine execution, which orchestrated the steps from the workflow plan through corresponding https://schema.org/ControlActions that spawned the tool's execution (https://schema.org/CreateAction). It is possible that a single workflow step had multiple such executions (e.g. array iterations). Not shown in figure: https://schema.org/actionStatus and https://schema.org/error to indicate step/workflow execution status. The filled diamond $\blacklozenge$ indicates composition, empty diamond $\Diamond$ aggregation, and other arrows relations.
  • Figure 5: Venn diagram of the specifications for the various RO-Crate profiles. Process Run Crate specifies how to describe the fundamental classes involved in a computational run, and thus is the basis for all profiles in the WRROC collection. Workflow Run Crate inherits the specifications of both Process Run Crate and Workflow RO-Crate. Provenance Run Crate, in turn, inherits the specifications of Workflow Run Crate (and in a sense includes multiple Process Runs for each step execution, but within a single Crate).