Table of Contents
Fetching ...

CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads

Radostin Stoyanov, Viktória Spišaková, Jesus Ramos, Steven Gurfinkel, Andrei Vagin, Adrian Reber, Wesley Armour, Rodrigo Bruno

TL;DR

CRIUgpu tackles the problem of efficiently and transparently checkpointing GPU-accelerated workloads in containerized, multi-tenant environments. It moves beyond device API interception by leveraging GPU driver checkpointing through CUDA and ROCm plugins tightly integrated with CRIU, yielding unified CPU-GPU snapshots. The approach eliminates steady-state overheads associated with interception, scales near linearly with the number of GPUs, and demonstrates fast recovery across diverse DL and HPC workloads, including large language models. The work has practical impact by enabling robust fault tolerance and rapid recovery for GPU-heavy workloads in production, with documented support for both NVIDIA and AMD devices and integration with common container runtimes. It also opens avenues for further optimizations such as compression, asynchronous data transfer, and broader hardware support.

Abstract

Deep learning training at scale is resource-intensive and time-consuming, often running across hundreds or thousands of GPUs for weeks or months. Efficient checkpointing is crucial for running these workloads, especially in multi-tenant environments where compute resources are shared, and job preemptions or interruptions are common. However, transparent and unified GPU snapshots are particularly challenging because of the hardware architecture differences between CPU and GPU, including memory subsystems, dynamic parallelism, and thread synchronization. State-of-the-art GPU checkpointing techniques typically leverage mechanisms that intercept, log, and replay device API calls. However, this approach adds performance overhead and requires hardware-specific implementation that is difficult to test, maintain, and integrate with existing container platforms. In this paper, we present CRIUgpu - a novel approach for transparent checkpointing of GPU-accelerated workloads that builds on recently introduced driver capabilities, enabling support for CUDA and ROCm applications. Our evaluation results show that CRIUgpu works with a variety of deep learning and high-performance computing workloads running across multiple GPUs, completely eliminating steady-state performance overheads, and significantly reducing recovery times compared to state-of-the-art transparent GPU checkpointing mechanisms.

CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads

TL;DR

CRIUgpu tackles the problem of efficiently and transparently checkpointing GPU-accelerated workloads in containerized, multi-tenant environments. It moves beyond device API interception by leveraging GPU driver checkpointing through CUDA and ROCm plugins tightly integrated with CRIU, yielding unified CPU-GPU snapshots. The approach eliminates steady-state overheads associated with interception, scales near linearly with the number of GPUs, and demonstrates fast recovery across diverse DL and HPC workloads, including large language models. The work has practical impact by enabling robust fault tolerance and rapid recovery for GPU-heavy workloads in production, with documented support for both NVIDIA and AMD devices and integration with common container runtimes. It also opens avenues for further optimizations such as compression, asynchronous data transfer, and broader hardware support.

Abstract

Deep learning training at scale is resource-intensive and time-consuming, often running across hundreds or thousands of GPUs for weeks or months. Efficient checkpointing is crucial for running these workloads, especially in multi-tenant environments where compute resources are shared, and job preemptions or interruptions are common. However, transparent and unified GPU snapshots are particularly challenging because of the hardware architecture differences between CPU and GPU, including memory subsystems, dynamic parallelism, and thread synchronization. State-of-the-art GPU checkpointing techniques typically leverage mechanisms that intercept, log, and replay device API calls. However, this approach adds performance overhead and requires hardware-specific implementation that is difficult to test, maintain, and integrate with existing container platforms. In this paper, we present CRIUgpu - a novel approach for transparent checkpointing of GPU-accelerated workloads that builds on recently introduced driver capabilities, enabling support for CUDA and ROCm applications. Our evaluation results show that CRIUgpu works with a variety of deep learning and high-performance computing workloads running across multiple GPUs, completely eliminating steady-state performance overheads, and significantly reducing recovery times compared to state-of-the-art transparent GPU checkpointing mechanisms.

Paper Structure

This paper contains 25 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: A comparison between (a) CRIUgpu and (b) state-of-the-art checkpointing system using a mechanism for interception, logging and replay of device API calls.
  • Figure 2: Analysis of intercepted CUDA API calls and memory transfers between host and device during neural network training reveals significant overhead that increases with the number of epochs. Setup: A PyTorch implementation of stochastic gradient descent neural network training with one input layer (10 features), one hidden layer (50 units), and one output layer (1 unit) running with (Cricket) and without (Baseline) API interception.
  • Figure 3: An overview of the transparent checkpoint/restore mechanisms with CUDA and AMD GPU plugins for CRIU.
  • Figure 4: Sequence diagrams of CRIU interactions with NVIDIA and AMD drivers.
  • Figure 5: In-memory GPU checkpoint/restore with H100. Similar results are observed with A100.
  • ...and 2 more figures