Towards cloud-native scientific workflow management
Michal Orzechowski, Bartosz Balis, Krzysztof Janecki
TL;DR
The paper investigates how to execute scientific workflows on Kubernetes-enabled cloud-native infrastructures by comparing a simple Job-based model, a Job-based model with task clustering, and a novel Worker Pools model implemented in HyperFlow. Using the Montage workflow on a Kubernetes/OpenStack cluster, it finds that the Worker Pools approach delivers the best cluster utilization and shortest makespan, though it introduces higher implementation and maintenance complexity. The work highlights a fundamental trade-off between simplicity and performance in cloud-native workflow management and provides concrete architectural guidance for choosing between models based on resource and maintenance constraints. It also contributes a practical HyperFlow implementation of the Worker Pools model that demonstrates the feasibility of microservice-based, auto-scalable task execution in scientific workflows.
Abstract
Cloud-native is an approach to building and running scalable applications in modern cloud infrastructures, with the Kubernetes container orchestration platform being often considered as a fundamental cloud-native building block. In this paper, we evaluate alternative execution models for scientific workflows in Kubernetes. We compare the simplest job-based model, its variant with task clustering, and finally we propose a cloud-native model based on microservices comprising auto-scalable worker-pools. We implement the proposed models in the HyperFlow workflow management system, and evaluate them using a large Montage workflow on a Kubernetes cluster. The results indicate that the proposed cloud-native worker-pools execution model achieves best performance in terms of average cluster utilization, resulting in a nearly 20\% improvement of the workflow makespan compared to the best-performing job-based model. However, better performance comes at the cost of significantly higher complexity of the implementation and maintenance. We believe that our experiments provide a valuable insight into the performance, advantages and disadvantages of alternative cloud-native execution models for scientific workflows.
