A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning

Yuesheng Xu; Arielle Carr

A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning

Yuesheng Xu, Arielle Carr

TL;DR

This paper proposes a dynamic weighting strategy to mitigate the problem of straggler nodes due to failure, enhancing the performance and efficiency of the overall training process.

Abstract

The increasing complexity of deep learning models and the demand for processing vast amounts of data make the utilization of large-scale distributed systems for efficient training essential. These systems, however, face significant challenges such as communication overhead, hardware limitations, and node failure. This paper investigates various optimization techniques in distributed deep learning, including Elastic Averaging SGD (EASGD) and the second-order method AdaHessian. We propose a dynamic weighting strategy to mitigate the problem of straggler nodes due to failure, enhancing the performance and efficiency of the overall training process. We conduct experiments with different numbers of workers and communication periods to demonstrate improved convergence rates and test performance using our strategy.

A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning

TL;DR

This paper proposes a dynamic weighting strategy to mitigate the problem of straggler nodes due to failure, enhancing the performance and efficiency of the overall training process.

Abstract

Paper Structure (15 sections, 17 equations, 5 figures)

This paper contains 15 sections, 17 equations, 5 figures.

Introduction
Problem Setting
Parallelism in Distributed Deep Learning
Model Parallelism
Data Parallelism
Synchronous and Asynchronous Methods
Related Work
Elastic Averaging Stochastic Gradient Descent
Second-Order Methods
Method
Data Overlap
Dynamic Weight
Experiment Settings
Results
Conclusions and Future Work

Figures (5)

Figure 1: This diagram illustrates the concept of model parallelism. Each worker holds a different segment of the model, allowing for parallel computation and handling of very large models that cannot fit into a single device.
Figure 2: This diagram illustrates the concept of data parallelism. Each worker node holds a subset of the dataset and communicates with the master node to update both its own model as well as the aggregated model.
Figure 3: Data overlap ratios: $\{0\%, 12.5\%, 25\%, 37.5\%, 50\%\}$
Figure 4: Test accuracy over communication rounds for workers $k \in \{4, 8\}$ and communication period $\tau \in \{1,2,4\}$. Each experiment is averaged over 3 runs.
Figure 5: Training loss over communication rounds for workers $k \in \{4, 8\}$ and communication period $\tau \in \{1,2,4\}$. Each experiment is averaged over 3 runs.

A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning

TL;DR

Abstract

A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)