Table of Contents
Fetching ...

A Fault Tolerance Mechanism for Hybrid Scientific Workflows

Alberto Mulone, Doriana Medić, Marco Aldinucci

TL;DR

The paper addresses fault tolerance for hybrid scientific workflows running across heterogeneous distributed environments where data locality is crucial. It introduces a recovery-workflow mechanism that creates separate workflows to recover failed steps, guided by a provenance graph and a BFS-based rollback strategy, and implements this approach within a StreamFlow CWL-based WMS. A formal syntactic representation and an example demonstration accompany the implementation, including support for loops and multi-instance steps, validated through Kubernetes-based experiments under simulated faults. The work advances reliability in WMS for federated and heterogeneous computing, while acknowledging overhead from metadata collection and cross-workflow synchronization and outlining future directions such as overhead evaluation, real-world workloads, checkpointing, and extending semantics to nondeterministic scenarios.

Abstract

In large distributed systems, failures are a daily event occurring frequently, especially with growing numbers of computation tasks and locations on which they are deployed. The advantage of representing an application with a workflow is the possibility of exploiting Workflow Management System (WMS) features such as portability. A relevant feature that some WMSs supply is reliability. Over recent years, the emergence of hybrid workflows has posed new and intriguing challenges by increasing the possibility of distributing computations involving heterogeneous and independent environments. Consequently, the number of possible points of failure in the execution increased, creating different important challenges that are interesting to study. This paper presents the implementation of a fault tolerance mechanism for hybrid workflows based on the recovery and rollback approach. A representation of the hybrid workflows with the formal framework is provided, together with the experiments demonstrating the functionality of implementing approach.

A Fault Tolerance Mechanism for Hybrid Scientific Workflows

TL;DR

The paper addresses fault tolerance for hybrid scientific workflows running across heterogeneous distributed environments where data locality is crucial. It introduces a recovery-workflow mechanism that creates separate workflows to recover failed steps, guided by a provenance graph and a BFS-based rollback strategy, and implements this approach within a StreamFlow CWL-based WMS. A formal syntactic representation and an example demonstration accompany the implementation, including support for loops and multi-instance steps, validated through Kubernetes-based experiments under simulated faults. The work advances reliability in WMS for federated and heterogeneous computing, while acknowledging overhead from metadata collection and cross-workflow synchronization and outlining future directions such as overhead evaluation, real-world workloads, checkpointing, and extending semantics to nondeterministic scenarios.

Abstract

In large distributed systems, failures are a daily event occurring frequently, especially with growing numbers of computation tasks and locations on which they are deployed. The advantage of representing an application with a workflow is the possibility of exploiting Workflow Management System (WMS) features such as portability. A relevant feature that some WMSs supply is reliability. Over recent years, the emergence of hybrid workflows has posed new and intriguing challenges by increasing the possibility of distributing computations involving heterogeneous and independent environments. Consequently, the number of possible points of failure in the execution increased, creating different important challenges that are interesting to study. This paper presents the implementation of a fault tolerance mechanism for hybrid workflows based on the recovery and rollback approach. A representation of the hybrid workflows with the formal framework is provided, together with the experiments demonstrating the functionality of implementing approach.
Paper Structure (14 sections, 16 equations, 5 figures)

This paper contains 14 sections, 16 equations, 5 figures.

Figures (5)

  • Figure 1: Workflow model presents a loop of 3 steps where the S2 (i.e. SumRow) step has multiple instances. These steps are mapped on the A, B, C and D locations.
  • Figure 2: Execution of the workflow in Fig. \ref{['fig:workflow-experiment']}. Some step names are omitted for the sake of readability. The execution is represented in minutes and seconds.
  • Figure 3: Left: the visited provenance graph. Right: the created recovery workflow.
  • Figure 4: Reduction semantics rules.
  • Figure 5: Recovery semantics rules.

Theorems & Definitions (1)

  • definition thmcounterdefinition