A Deep Reinforcement Learning Approach for Cost Optimized Workflow Scheduling in Cloud Computing Environments

Amanda Jayanetti; Saman Halgamuge; Rajkumar Buyya

A Deep Reinforcement Learning Approach for Cost Optimized Workflow Scheduling in Cloud Computing Environments

Amanda Jayanetti, Saman Halgamuge, Rajkumar Buyya

TL;DR

This work tackles cost-optimized workflow scheduling in cloud environments where spot instances offer discounts but introduce interruptions and price variability. It proposes a Deep Reinforcement Learning framework with a hierarchical action space and multi-actor networks guided by a common critic, trained with Proximal Policy Optimization (PPO) to minimize $MC = \sum_{t_j\in T} CC(t_j)$, where $CC(t_j) = CT(t_j) \cdot UC$ and $CT(t_j) = \frac{L(t_j)}{F}$. The framework is integrated end-to-end into the open-source Argo workflow engine running on Kubernetes to schedule DAG-based workflows by selecting node groups and specific nodes. Results show substantial cost savings compared with Random, Kubernetes default, and On-Demand policies, albeit with increased execution times and higher spot-interruption rates, illustrating a practical cost-utility trade-off for multi-cloud, container-native workflows. The approach demonstrates a scalable, deployable solution for cost-aware workflow management in real-world cloud environments.

Abstract

Cost optimization is a common goal of workflow schedulers operating in cloud computing environments. The use of spot instances is a potential means of achieving this goal, as they are offered by cloud providers at discounted prices compared to their on-demand counterparts in exchange for reduced reliability. This is due to the fact that spot instances are subjected to interruptions when spare computing capacity used for provisioning them is needed back owing to demand variations. Also, the prices of spot instances are not fixed as pricing is dependent on long term supply and demand. The possibility of interruptions and pricing variations associated with spot instances adds a layer of uncertainty to the general problem of workflow scheduling across cloud computing environments. These challenges need to be efficiently addressed for enjoying the cost savings achievable with the use of spot instances without compromising the underlying business requirements. To this end, in this paper we use Deep Reinforcement Learning for developing an autonomous agent capable of scheduling workflows in a cost efficient manner by using an intelligent mix of spot and on-demand instances. The proposed solution is implemented in the open source container native Argo workflow engine that is widely used for executing industrial workflows. The results of the experiments demonstrate that the proposed scheduling method is capable of outperforming the current benchmarks.

A Deep Reinforcement Learning Approach for Cost Optimized Workflow Scheduling in Cloud Computing Environments

TL;DR

, where

and

. The framework is integrated end-to-end into the open-source Argo workflow engine running on Kubernetes to schedule DAG-based workflows by selecting node groups and specific nodes. Results show substantial cost savings compared with Random, Kubernetes default, and On-Demand policies, albeit with increased execution times and higher spot-interruption rates, illustrating a practical cost-utility trade-off for multi-cloud, container-native workflows. The approach demonstrates a scalable, deployable solution for cost-aware workflow management in real-world cloud environments.

Abstract

Paper Structure (17 sections, 15 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 17 sections, 15 equations, 4 figures, 1 table, 1 algorithm.

Introduction
Related Work
Problem Formulation
Background and Proposed Approach
Kubernetes
Argo Workflow Engine
Reinforcement Learning
Proposed RL Framework
Agent Environment
Multi-Actor RL Algorithm
Performance Evaluation
Experimental testbed
Experimental dataset
DRL Scheduler Implementation
Comparison Algorithms
...and 2 more sections

Figures (4)

Figure 1: System Architecture
Figure 2: Proposed hierarchical action space and multi-actor DRL model
Figure 3: Sequence diagram of DRL based scheduling framework
Figure 4: Comparison of performance of scheduling algorithms on an experimental dataset

A Deep Reinforcement Learning Approach for Cost Optimized Workflow Scheduling in Cloud Computing Environments

TL;DR

Abstract

A Deep Reinforcement Learning Approach for Cost Optimized Workflow Scheduling in Cloud Computing Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (4)