StreamFlow: cross-breeding cloud with HPC
Iacopo Colonnelli, Barbara Cantalupo, Ivan Merelli, Marco Aldinucci
TL;DR
The paper addresses the portability gap of scientific workflows across heterogeneous environments (HPC and cloud) by introducing StreamFlow, a declarative cross-site execution layer that attaches environment descriptions to a workflow graph and enables multi-site execution without a common data space. It integrates with existing coordination languages (notably CWL) through a Connector-based framework and partitions workflows into atomic multi-container deployment units, coupled with data-transfer-aware scheduling and data management. The key contributions include the DeploymentManager, DataManager, and a StreamFlow file (streamflow.yml) that binds tasks to environment models, demonstrated on a CWL-described single-cell RNA-seq pipeline executed across Kubernetes and an on-prem HPC cluster, with a hybrid HPC/cloud configuration showing comparable performance. The work highlights practical benefits for resource utilization and reproducibility in data-intensive bioscience pipelines, and points to future enhancements in language support, additional connectors, and richer inter-container communication abstractions.
Abstract
Workflows are among the most commonly used tools in a variety of execution environments. Many of them target a specific environment; few of them make it possible to execute an entire workflow in different environments, e.g. Kubernetes and batch clusters. We present a novel approach to workflow execution, called StreamFlow, that complements the workflow graph with the declarative description of potentially complex execution environments, and that makes it possible the execution onto multiple sites not sharing a common data space. StreamFlow is then exemplified on a novel bioinformatics pipeline for single-cell transcriptomic data analysis workflow.
