Straggler Tolerant and Resilient DL Training on Homogeneous GPUs

Zeyu Zhang; Haiying Shen

Straggler Tolerant and Resilient DL Training on Homogeneous GPUs

Zeyu Zhang, Haiying Shen

TL;DR

This work investigates straggler phenomena in DL training on homogeneous GPUs, revealing that CPU and bandwidth imbalances cause widespread latency and that existing ASGD/SSGD mitigations do not reliably improve time-to-accuracy (TTA). It introduces STAR, a Straggler Tolerant And Resilient training system, which combines straggler prediction, static/dynamic x-order synchronization modes, heuristic and ML-based mode selection, and resource-aware prevention to minimize TTA for both PS and AR architectures. Through trace-driven AWS experiments, STAR achieves substantial TTA reductions (about 48-84% in PS and 51-70% in AR) while maintaining converged accuracy, with additional improvements in job completion time and reduced straggler counts. The work provides open-source STAR implementations and detailed analyses of the trade-offs between synchronization modes, resource consumption, and proactive prevention strategies, offering practical guidance for deploying straggler-resilient DL on homogeneous GPU clusters.

Abstract

Despite the popularity of homogeneous GPU-based deep learning (DL) training, the prevalence, causes and impact of stragglers and the effectiveness of existing straggler mitigation approaches are still not well understood in this scenario due to limited research on these questions. To fill this gap, we conducted comprehensive experiments and found that stragglers remain widespread due to CPU and bandwidth usage imbalances. Additionally, existing mitigation methods that switch from synchronous stochastic gradient descent (SSGD) to asynchronous SGD (ASGD) may not improve Time-To-Accuracy (TTA) and can even generate more stragglers due to its higher resource consumption. To address these newly found problems, we propose the Straggler Tolerant And Resilient DL training system (STAR). STAR includes new synchronization modes that group workers for each parameter updating. It has a heuristic and an ML method to choose the optimal synchronization mode for minimizing TTA, and reallocates resources to support the selected mode while minimizing the impact on co-located jobs. Moreover, it proactively prevents stragglers by avoiding overloading the CPU and bandwidth resources in allocating PSs (which consume high CPU and bandwidth) and in gradient transmission. Our trace-driven evaluation on AWS shows that STAR generates 48-84% and 51-70% lower TTA than state-of-the-art systems in the PS and all-reduce architectures, respectively, while maintaining the converged accuracy of SSGD. The code for STAR is open-sourced.

Straggler Tolerant and Resilient DL Training on Homogeneous GPUs

TL;DR

Abstract

Straggler Tolerant and Resilient DL Training on Homogeneous GPUs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (29)