Table of Contents
Fetching ...

Optimizing Frequent Checkpointing via Low-Cost Differential for Distributed Training Systems

Chenxuan Yao, Yuchong Hu, Feifan Liu, Zhengyu Liu, Lin Wang, Mingqi Li, Dan Feng

TL;DR

This work tackles the high cost of frequent differential checkpointing in distributed deep learning. It introduces LowDiff, which reuses compressed gradients as compressed differentials to eliminate compression overhead and reduce transmission costs, and adds batched gradient writing with dynamic configuration to minimize wasted time. For non-compression scenarios, LowDiff+ employs layer-wise gradient reuse and CPU-based asynchronous persistence to enable near per-iteration checkpointing with minimal interference, including a parallel recovery design. Across diverse models and hardware, the approach achieves up to near-iteration checkpointing with minimal overhead and substantial speedups in training time and recovery, highlighting its practical impact for fault-tolerant, high-frequency checkpointing in large-scale training systems.

Abstract

Distributed training of large deep-learning models often leads to failures, so checkpointing is commonly employed for recovery. State-of-the-art studies focus on frequent checkpointing for fast recovery from failures. However, it generates numerous checkpoints, incurring substantial costs and thus degrading training performance. Recently, differential checkpointing has been proposed to reduce costs, but it is limited to recommendation systems, so its application to general distributed training systems remains unexplored. We proposes \sysname, an efficient frequent checkpointing framework that \textit{reuses} compressed gradients, serving as differential checkpoints to reduce cost. Furthermore, \sysname incorporates a batched gradient write optimization to persist these differentials to storage efficiently. It also dynamically tunes both the checkpoint frequency and the batching size to maximize performance. In non-compression scenario, We further proposes \sysnameplus with a layer-wise gradient reusing and snapshotting approach and a CPU-based asynchronous persistence strategy, enabling frequent checkpointing without gradient compression. Experiments on various workloads show that \sysname can achieve checkpointing frequency up to per iteration with less than 3.1\% runtime overhead.

Optimizing Frequent Checkpointing via Low-Cost Differential for Distributed Training Systems

TL;DR

This work tackles the high cost of frequent differential checkpointing in distributed deep learning. It introduces LowDiff, which reuses compressed gradients as compressed differentials to eliminate compression overhead and reduce transmission costs, and adds batched gradient writing with dynamic configuration to minimize wasted time. For non-compression scenarios, LowDiff+ employs layer-wise gradient reuse and CPU-based asynchronous persistence to enable near per-iteration checkpointing with minimal interference, including a parallel recovery design. Across diverse models and hardware, the approach achieves up to near-iteration checkpointing with minimal overhead and substantial speedups in training time and recovery, highlighting its practical impact for fault-tolerant, high-frequency checkpointing in large-scale training systems.

Abstract

Distributed training of large deep-learning models often leads to failures, so checkpointing is commonly employed for recovery. State-of-the-art studies focus on frequent checkpointing for fast recovery from failures. However, it generates numerous checkpoints, incurring substantial costs and thus degrading training performance. Recently, differential checkpointing has been proposed to reduce costs, but it is limited to recommendation systems, so its application to general distributed training systems remains unexplored. We proposes \sysname, an efficient frequent checkpointing framework that \textit{reuses} compressed gradients, serving as differential checkpoints to reduce cost. Furthermore, \sysname incorporates a batched gradient write optimization to persist these differentials to storage efficiently. It also dynamically tunes both the checkpoint frequency and the batching size to maximize performance. In non-compression scenario, We further proposes \sysnameplus with a layer-wise gradient reusing and snapshotting approach and a CPU-based asynchronous persistence strategy, enabling frequent checkpointing without gradient compression. Experiments on various workloads show that \sysname can achieve checkpointing frequency up to per iteration with less than 3.1\% runtime overhead.

Paper Structure

This paper contains 28 sections, 6 equations, 19 figures, 3 tables, 2 algorithms.

Figures (19)

  • Figure 1: Impacts of DC computation and transmission frequency (in iterations) on training performance of GPT2-L.
  • Figure 2: Motivating example of LowDiff.
  • Figure 3: LowDiff's DC can run in parallel with forward pass(F), backward pass(B), and model update(U).
  • Figure 4: Time of iteration, full checkpointing, and differential checkpointing.
  • Figure 5: Architecture of LowDiff
  • ...and 14 more figures