Hydra: Brokering Cloud and HPC Resources to Support the Execution of Heterogeneous Workloads at Scale
Aymen Alsaadi, Shantenu Jha, Matteo Turilli
TL;DR
Hydra addresses the challenge of executing heterogeneous workloads across cloud and HPC platforms by providing a general-purpose brokering layer that can concurrently provision resources on commercial clouds, NSF-sponsored clouds, and HPC systems. It uses a connector-based Python architecture with a Provider Proxy and a Service Proxy, including CaaS, HPC, and Data managers, to map and execute tasks as executables or containers while avoiding full workflow management. The paper contributes (1) a design for broker with heterogeneity, (2) a reference Hydra implementation, (3) an experimental characterization of overheads and scaling, and (4) an end-to-end demonstration on the FACTS sea-level workflow, showing cross-platform scalability. The results show Hydra incurs minimal overhead relative to platform costs and achieves strong/weak scaling, enabling large-scale, cross-platform scientific workflows with flexible resource choices.
Abstract
Scientific discovery increasingly depends on middleware that enables the execution of heterogeneous workflows on heterogeneous platforms One of the main challenges is to design software components that integrate within the existing ecosystem to enable scale and performance across cloud and high-performance computing HPC platforms Researchers are met with a varied computing landscape which includes services available on commercial cloud platforms data and network capabilities specifically designed for scientific discovery on government-sponsored cloud platforms and scale and performance on HPC platforms We present Hydra an intra cross-cloud HPC brokering system capable of concurrently acquiring resources from commercial private cloud and HPC platforms and managing the execution of heterogeneous workflow applications on those resources This paper offers four main contributions (1) the design of brokering capabilities in the presence of task platform resource and middleware heterogeneity; (2) a reference implementation of that design with Hydra; (3) an experimental characterization of Hydra s overheads and strong weak scaling with heterogeneous workloads and platforms and, (4) the implementation of a workflow that models sea rise with Hydra and its scaling on cloud and HPC platforms
