A Deep Reinforcement Learning Approach for Cost Optimized Workflow Scheduling in Cloud Computing Environments
Amanda Jayanetti, Saman Halgamuge, Rajkumar Buyya
TL;DR
This work tackles cost-optimized workflow scheduling in cloud environments where spot instances offer discounts but introduce interruptions and price variability. It proposes a Deep Reinforcement Learning framework with a hierarchical action space and multi-actor networks guided by a common critic, trained with Proximal Policy Optimization (PPO) to minimize $MC = \sum_{t_j\in T} CC(t_j)$, where $CC(t_j) = CT(t_j) \cdot UC$ and $CT(t_j) = \frac{L(t_j)}{F}$. The framework is integrated end-to-end into the open-source Argo workflow engine running on Kubernetes to schedule DAG-based workflows by selecting node groups and specific nodes. Results show substantial cost savings compared with Random, Kubernetes default, and On-Demand policies, albeit with increased execution times and higher spot-interruption rates, illustrating a practical cost-utility trade-off for multi-cloud, container-native workflows. The approach demonstrates a scalable, deployable solution for cost-aware workflow management in real-world cloud environments.
Abstract
Cost optimization is a common goal of workflow schedulers operating in cloud computing environments. The use of spot instances is a potential means of achieving this goal, as they are offered by cloud providers at discounted prices compared to their on-demand counterparts in exchange for reduced reliability. This is due to the fact that spot instances are subjected to interruptions when spare computing capacity used for provisioning them is needed back owing to demand variations. Also, the prices of spot instances are not fixed as pricing is dependent on long term supply and demand. The possibility of interruptions and pricing variations associated with spot instances adds a layer of uncertainty to the general problem of workflow scheduling across cloud computing environments. These challenges need to be efficiently addressed for enjoying the cost savings achievable with the use of spot instances without compromising the underlying business requirements. To this end, in this paper we use Deep Reinforcement Learning for developing an autonomous agent capable of scheduling workflows in a cost efficient manner by using an intelligent mix of spot and on-demand instances. The proposed solution is implemented in the open source container native Argo workflow engine that is widely used for executing industrial workflows. The results of the experiments demonstrate that the proposed scheduling method is capable of outperforming the current benchmarks.
