Table of Contents
Fetching ...

Formal Definition and Implementation of Reproducibility Tenets for Computational Workflows

Nicholas J. Pritchard, Andreas Wicenec

TL;DR

This paper tackles the reproducibility crisis in computational science by proposing a scale- and system-agnostic UML model for scientific workflows, together with seven reproducibility tenets and a blockchain-inspired BlockDAG signature mechanism. The approach enables amortized, constant-time construction and verification of workflow signatures, and is implemented in the DALiuGE system to embed provenance for formal verification of scientific quality. Through a demonstrative lowpass-filter workflow, the authors show how Rerun, Repeat, Recompute, Reproduce, and Replicate notions can reveal subtle discrepancies and establish equivalence across multiple hardware- and software-variant executions. The work provides a concrete pathway toward cross-system reproducibility testing for large-scale data-intensive astronomy projects like the SKA and suggests extending provenance and verification to additional workflow systems. Overall, the framework combines formal UML modeling, principled provenance collection, and a practical signature mechanism to improve the reliability and interpretability of complex scientific workflows.

Abstract

Computational workflow management systems power contemporary data-intensive sciences. The slowly resolving reproducibility crisis presents both a sobering warning and an opportunity to iterate on what science and data processing entails. The Square Kilometre Array (SKA), the world's largest radio telescope, is among the most extensive scientific projects underway and presents grand scientific collaboration and data-processing challenges. In this work, we aim to improve the ability of workflow management systems to facilitate reproducible, high-quality science. This work presents a scale and system-agnostic computational workflow model and extends five well-known reproducibility concepts into seven well-defined tenets for this workflow model. Additionally, we present a method to construct workflow execution signatures using cryptographic primitives in amortized constant time. We combine these three concepts and provide a concrete implementation in Data Activated Flow Graph Engine (DALiuGE), a workflow management system for the SKA to embed specific provenance information into workflow signatures, demonstrating the possibility of facilitating automatic formal verification of scientific quality in amortized constant time. We validate our approach with a simple yet representative astronomical processing task: filtering a noisy signal with a lowpass filter using CPU and GPU methods. This example shows the practicality and efficacy of combining formal tenet definitions with a workflow signature generation mechanism. Our framework, spanning formal UML specification, principled provenance information collection based on reproducibility tenets, and finally, a concrete example implementation in DALiuGE illuminates otherwise obscure scientific discrepancies and similarities between principally identical workflow executions.

Formal Definition and Implementation of Reproducibility Tenets for Computational Workflows

TL;DR

This paper tackles the reproducibility crisis in computational science by proposing a scale- and system-agnostic UML model for scientific workflows, together with seven reproducibility tenets and a blockchain-inspired BlockDAG signature mechanism. The approach enables amortized, constant-time construction and verification of workflow signatures, and is implemented in the DALiuGE system to embed provenance for formal verification of scientific quality. Through a demonstrative lowpass-filter workflow, the authors show how Rerun, Repeat, Recompute, Reproduce, and Replicate notions can reveal subtle discrepancies and establish equivalence across multiple hardware- and software-variant executions. The work provides a concrete pathway toward cross-system reproducibility testing for large-scale data-intensive astronomy projects like the SKA and suggests extending provenance and verification to additional workflow systems. Overall, the framework combines formal UML modeling, principled provenance collection, and a practical signature mechanism to improve the reliability and interpretability of complex scientific workflows.

Abstract

Computational workflow management systems power contemporary data-intensive sciences. The slowly resolving reproducibility crisis presents both a sobering warning and an opportunity to iterate on what science and data processing entails. The Square Kilometre Array (SKA), the world's largest radio telescope, is among the most extensive scientific projects underway and presents grand scientific collaboration and data-processing challenges. In this work, we aim to improve the ability of workflow management systems to facilitate reproducible, high-quality science. This work presents a scale and system-agnostic computational workflow model and extends five well-known reproducibility concepts into seven well-defined tenets for this workflow model. Additionally, we present a method to construct workflow execution signatures using cryptographic primitives in amortized constant time. We combine these three concepts and provide a concrete implementation in Data Activated Flow Graph Engine (DALiuGE), a workflow management system for the SKA to embed specific provenance information into workflow signatures, demonstrating the possibility of facilitating automatic formal verification of scientific quality in amortized constant time. We validate our approach with a simple yet representative astronomical processing task: filtering a noisy signal with a lowpass filter using CPU and GPU methods. This example shows the practicality and efficacy of combining formal tenet definitions with a workflow signature generation mechanism. Our framework, spanning formal UML specification, principled provenance information collection based on reproducibility tenets, and finally, a concrete example implementation in DALiuGE illuminates otherwise obscure scientific discrepancies and similarities between principally identical workflow executions.
Paper Structure (33 sections, 1 equation, 18 figures, 11 tables)

This paper contains 33 sections, 1 equation, 18 figures, 11 tables.

Figures (18)

  • Figure 1: A UML diagram depicting an arbitrary-scale workflow definition. Scientific information effectively expresses the information used in making a scientific claim and is derived by running a workflow. A workflow is a collection of components and data artifacts with an imposed ordered structure. A logical workflow is a workflow comprised of logical components. A physical workflow is a workflow comprised of physical components. A component is an atomic digital entity; it could be a (logical or physical) task or data artifact. A logical task describes a task characterized by high-level information like programming language, algorithm, and required options and parameters. Logical task fields capture additional detail about a logical task, such as the exact software script, package, or command executed. A physical task is a single, executable instance of a logical task. A logical task could be realized by one or more physical tasks depending on the degree of parallelism of the workflow. Physical task fields capture additional detail about a physical task's execution, such as the machine details a task is eventually executed on. A logical data artifact is a logical data resource characterized by storage type and required configurations. A physical data artifact is the actual datastore like a file, distributed file system, or database. A single logical data artifact may encompass many individual physical data artifacts at runtime, dependent on the degree of parallelism of the workflow. This ordered structure requires components and data artifacts to appear in alternating order; another component cannot precede a component.
  • Figure 2: A UML workflow model with required identical components for workflow Reruns highlighted in blue.
  • Figure 3: A UML workflow model with required identical components for workflow repetitions highlighted in blue.
  • Figure 4: A UML workflow model with required identical components for workflow Recomputations highlighted in blue.
  • Figure 5: A UML workflow model with required identical components for workflow Reproductions highlighted in blue.
  • ...and 13 more figures