DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud

Yoochan Kim; Kihyun Kim; Yonghyeon Cho; Jinwoo Kim; Awais Khan; Ki-Dong Kang; Baik-Song An; Myung-Hoon Cha; Hong-Yeon Kim; Youngjae Kim

DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud

Yoochan Kim, Kihyun Kim, Yonghyeon Cho, Jinwoo Kim, Awais Khan, Ki-Dong Kang, Baik-Song An, Myung-Hoon Cha, Hong-Yeon Kim, Youngjae Kim

TL;DR

DeepVM tackles the barrier to affordable distributed deep learning by balancing Spot and On-Demand VMs under a formal four-stage framework. It introduces FLOPP-based instance assessment and an LP-driven architecture-level analysis (Single Anchor and Tiering) to maximize a cost-aware performance metric, while modeling overheads via scaling factors and network saturation. Evaluations in simulation and AWS show DeepVM reduces total cost and makespan compared with baselines, and its checkpointing-focused Tiering approach further enhances robustness in Spot VM environments. The method offers practical impact by democratizing access to large-scale DDL through principled, transparent cloud configurations, though portability currently hinges on AWS-specific instance data.

Abstract

Distributed Deep Learning (DDL), as a paradigm, dictates the use of GPU-based clusters as the optimal infrastructure for training large-scale Deep Neural Networks (DNNs). However, the high cost of such resources makes them inaccessible to many users. Public cloud services, particularly Spot Virtual Machines (VMs), offer a cost-effective alternative, but their unpredictable availability poses a significant challenge to the crucial checkpointing process in DDL. To address this, we introduce DeepVM, a novel solution that recommends cost-effective cluster configurations by intelligently balancing the use of Spot and On-Demand VMs. DeepVM leverages a four-stage process that analyzes instance performance using the FLOPP (FLoating-point Operations Per Price) metric, performs architecture-level analysis with linear programming, and identifies the optimal configuration for the user-specific needs. Extensive simulations and real-world deployments in the AWS environment demonstrate that DeepVM consistently outperforms other policies, reducing training costs and overall makespan. By enabling cost-effective checkpointing with Spot VMs, DeepVM opens up DDL to a wider range of users and facilitates a more efficient training of complex DNNs.

DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud

TL;DR

Abstract

Paper Structure (40 sections, 9 equations, 8 figures, 7 tables)

This paper contains 40 sections, 9 equations, 8 figures, 7 tables.

Introduction
Background and Related Work
Distributed Deep Learning
Preemption Hazard of Cloud Spot VM
Spot VMs and the On-Demand Dilemma in Checkpointing
Existing Approaches and Their Limitations
Design of DeepVM
Challenges
Overview
User Pricing Input
Instance-level Analysis
Architecture-level Analysis
Final Decision
Overhead Modeling
Scaling factor of multiple GPUs
...and 25 more sections

Figures (8)

Figure 1: Write throughput measurements of in-house testbed and AWS storage. We used 16 writer threads. with each thread performing 4KB block writing. The execution time for each experiment was set to 60 seconds.
Figure 2: An overview of DeepVM.
Figure 3: Single Anchor and Tiering Architecture.
Figure 4: Speedup results as the number of GPU-VMs increased for three different DL image models. g4dn.xlarge VMs were used.
Figure 5: Overhead analysis in modeling. Results show the estimated speedup (red) and actual speedup (blue) when training ResNet50 for 30 epochs, as the number of GPU-VMs increases for three different GPU-VMs.
...and 3 more figures

DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud

TL;DR

Abstract

DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud

Authors

TL;DR

Abstract

Table of Contents

Figures (8)