Table of Contents
Fetching ...

StreamFlow: cross-breeding cloud with HPC

Iacopo Colonnelli, Barbara Cantalupo, Ivan Merelli, Marco Aldinucci

TL;DR

The paper addresses the portability gap of scientific workflows across heterogeneous environments (HPC and cloud) by introducing StreamFlow, a declarative cross-site execution layer that attaches environment descriptions to a workflow graph and enables multi-site execution without a common data space. It integrates with existing coordination languages (notably CWL) through a Connector-based framework and partitions workflows into atomic multi-container deployment units, coupled with data-transfer-aware scheduling and data management. The key contributions include the DeploymentManager, DataManager, and a StreamFlow file (streamflow.yml) that binds tasks to environment models, demonstrated on a CWL-described single-cell RNA-seq pipeline executed across Kubernetes and an on-prem HPC cluster, with a hybrid HPC/cloud configuration showing comparable performance. The work highlights practical benefits for resource utilization and reproducibility in data-intensive bioscience pipelines, and points to future enhancements in language support, additional connectors, and richer inter-container communication abstractions.

Abstract

Workflows are among the most commonly used tools in a variety of execution environments. Many of them target a specific environment; few of them make it possible to execute an entire workflow in different environments, e.g. Kubernetes and batch clusters. We present a novel approach to workflow execution, called StreamFlow, that complements the workflow graph with the declarative description of potentially complex execution environments, and that makes it possible the execution onto multiple sites not sharing a common data space. StreamFlow is then exemplified on a novel bioinformatics pipeline for single-cell transcriptomic data analysis workflow.

StreamFlow: cross-breeding cloud with HPC

TL;DR

The paper addresses the portability gap of scientific workflows across heterogeneous environments (HPC and cloud) by introducing StreamFlow, a declarative cross-site execution layer that attaches environment descriptions to a workflow graph and enables multi-site execution without a common data space. It integrates with existing coordination languages (notably CWL) through a Connector-based framework and partitions workflows into atomic multi-container deployment units, coupled with data-transfer-aware scheduling and data management. The key contributions include the DeploymentManager, DataManager, and a StreamFlow file (streamflow.yml) that binds tasks to environment models, demonstrated on a CWL-described single-cell RNA-seq pipeline executed across Kubernetes and an on-prem HPC cluster, with a hybrid HPC/cloud configuration showing comparable performance. The work highlights practical benefits for resource utilization and reproducibility in data-intensive bioscience pipelines, and points to future enhancements in language support, additional connectors, and richer inter-container communication abstractions.

Abstract

Workflows are among the most commonly used tools in a variety of execution environments. Many of them target a specific environment; few of them make it possible to execute an entire workflow in different environments, e.g. Kubernetes and batch clusters. We present a novel approach to workflow execution, called StreamFlow, that complements the workflow graph with the declarative description of potentially complex execution environments, and that makes it possible the execution onto multiple sites not sharing a common data space. StreamFlow is then exemplified on a novel bioinformatics pipeline for single-cell transcriptomic data analysis workflow.

Paper Structure

This paper contains 19 sections, 10 figures.

Figures (10)

  • Figure 1: StreamFlow framework's logical stack. Coloured portions refer to existing technologies, while white ones are directly part of StreamFlow codebase. In particular, the orange area is related to the definition of the workflow's dependency graph, while the green area refers to the execution environments.
  • Figure 2: Workflow graph transformation to include model deployment and undeployment tasks. Orange nodes represent original tasks, while the others refer to model deployment (downward pointing arrow) and undeployment (upward pointing arrow) phases.
  • Figure 3: UML class diagram for the DeploymentManager class.
  • Figure 4: UML class diagram for the Connector interface.
  • Figure 5: UML class diagram for the Policy interface.
  • ...and 5 more figures