Table of Contents
Fetching ...

Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training

Ray Cao, Sherry Luo, Steve Gan, Sujeeth Jinesh

TL;DR

This work addresses failures in large-scale distributed machine learning by relaxing data consistency in data-parallel training. It evaluates three failure-recovery strategies—checkpointing, chain replication, and a novel stateless parameter server—implemented in Python with Ray and Zookeeper and tested by deliberately terminating the parameter server. The stateless parameter server maintains convergence under failure and can even improve accuracy by leveraging stale gradients, while chain replication and checkpointing converge but incur accuracy losses due to restarting from old states. The results suggest that allowing workers to produce updates during downtime can improve hardware utilization and yield cloud-cost parity with traditional methods, guiding future work toward hybrid approaches and broader model and accelerator support.

Abstract

In this study, we explore the impact of relaxing data consistency in parallel machine learning training during a failure using various parameter server configurations. Our failure recovery strategies include traditional checkpointing, chain replication (which ensures a backup server takes over in case of failure), and a novel stateless parameter server approach. In the stateless approach, workers continue generating gradient updates even if the parameter server is down, applying these updates once the server is back online. We compare these techniques to a standard checkpointing approach, where the training job is resumed from the latest checkpoint. To assess the resilience and performance of each configuration, we intentionally killed the parameter server during training for each experiment. Our experiment results indicate that the stateless parameter server approach continues to train towards convergence and improves accuracy as much as 10\% in the face of a failure despite using stale weights and gradients. The chain replication and checkpointing techniques demonstrate convergence but suffer from setbacks in accuracy due to restarting from old checkpoints. These results suggest that allowing workers to continue generating updates during server downtime and applying these updates later can effectively improve hardware utilization. Furthermore, despite higher resource usage, the stateless parameter server method incurs similar monetary costs in terms of hardware usage compared to standard checkpointing methods due to the pricing structure of common cloud providers.

Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training

TL;DR

This work addresses failures in large-scale distributed machine learning by relaxing data consistency in data-parallel training. It evaluates three failure-recovery strategies—checkpointing, chain replication, and a novel stateless parameter server—implemented in Python with Ray and Zookeeper and tested by deliberately terminating the parameter server. The stateless parameter server maintains convergence under failure and can even improve accuracy by leveraging stale gradients, while chain replication and checkpointing converge but incur accuracy losses due to restarting from old states. The results suggest that allowing workers to produce updates during downtime can improve hardware utilization and yield cloud-cost parity with traditional methods, guiding future work toward hybrid approaches and broader model and accelerator support.

Abstract

In this study, we explore the impact of relaxing data consistency in parallel machine learning training during a failure using various parameter server configurations. Our failure recovery strategies include traditional checkpointing, chain replication (which ensures a backup server takes over in case of failure), and a novel stateless parameter server approach. In the stateless approach, workers continue generating gradient updates even if the parameter server is down, applying these updates once the server is back online. We compare these techniques to a standard checkpointing approach, where the training job is resumed from the latest checkpoint. To assess the resilience and performance of each configuration, we intentionally killed the parameter server during training for each experiment. Our experiment results indicate that the stateless parameter server approach continues to train towards convergence and improves accuracy as much as 10\% in the face of a failure despite using stale weights and gradients. The chain replication and checkpointing techniques demonstrate convergence but suffer from setbacks in accuracy due to restarting from old checkpoints. These results suggest that allowing workers to continue generating updates during server downtime and applying these updates later can effectively improve hardware utilization. Furthermore, despite higher resource usage, the stateless parameter server method incurs similar monetary costs in terms of hardware usage compared to standard checkpointing methods due to the pricing structure of common cloud providers.
Paper Structure (14 sections, 8 figures)

This paper contains 14 sections, 8 figures.

Figures (8)

  • Figure 1: System Overview
  • Figure 2: Failures Overview
  • Figure 3: Pseudo-code describing the Stateless Parameter Server experiment
  • Figure 4: Training accuracy after killing and recovering once. Legend: Blue - Sync checkpointing, Orange - Async checkpointing, Green - Sync chain replication, Red - Async chain replication, Purple - Stateless parameter server
  • Figure 5: Training accuracy after killing and recovering twice. Legend: Blue - Sync checkpointing, Orange - Async checkpointing, Green - Sync chain replication, Red - Async chain replication, Purple - Stateless parameter server
  • ...and 3 more figures