Table of Contents
Fetching ...

Leveraging partial stragglers within gradient coding

Aditya Ramamoorthy, Ruoyu Meng, Vrinda S. Girimaji

TL;DR

This work presents novel gradient coding protocols that judiciously leverage the work performed by partial stragglers and presents efficient algorithms for optimizing the relative ordering of chunks within the workers; this ordering affects the overall execution time.

Abstract

Within distributed learning, workers typically compute gradients on their assigned dataset chunks and send them to the parameter server (PS), which aggregates them to compute either an exact or approximate version of $\nabla L$ (gradient of the loss function $L$). However, in large-scale clusters, many workers are slower than their promised speed or even failure-prone. A gradient coding solution introduces redundancy within the assignment of chunks to the workers and uses coding theoretic ideas to allow the PS to recover $\nabla L$ (exactly or approximately), even in the presence of stragglers. Unfortunately, most existing gradient coding protocols are inefficient from a computation perspective as they coarsely classify workers as operational or failed; the potentially valuable work performed by slow workers (partial stragglers) is ignored. In this work, we present novel gradient coding protocols that judiciously leverage the work performed by partial stragglers. Our protocols are efficient from a computation and communication perspective and numerically stable. For an important class of chunk assignments, we present efficient algorithms for optimizing the relative ordering of chunks within the workers; this ordering affects the overall execution time. For exact gradient reconstruction, our protocol is around $2\times$ faster than the original class of protocols and for approximate gradient reconstruction, the mean-squared-error of our reconstructed gradient is several orders of magnitude better.

Leveraging partial stragglers within gradient coding

TL;DR

This work presents novel gradient coding protocols that judiciously leverage the work performed by partial stragglers and presents efficient algorithms for optimizing the relative ordering of chunks within the workers; this ordering affects the overall execution time.

Abstract

Within distributed learning, workers typically compute gradients on their assigned dataset chunks and send them to the parameter server (PS), which aggregates them to compute either an exact or approximate version of (gradient of the loss function ). However, in large-scale clusters, many workers are slower than their promised speed or even failure-prone. A gradient coding solution introduces redundancy within the assignment of chunks to the workers and uses coding theoretic ideas to allow the PS to recover (exactly or approximately), even in the presence of stragglers. Unfortunately, most existing gradient coding protocols are inefficient from a computation perspective as they coarsely classify workers as operational or failed; the potentially valuable work performed by slow workers (partial stragglers) is ignored. In this work, we present novel gradient coding protocols that judiciously leverage the work performed by partial stragglers. Our protocols are efficient from a computation and communication perspective and numerically stable. For an important class of chunk assignments, we present efficient algorithms for optimizing the relative ordering of chunks within the workers; this ordering affects the overall execution time. For exact gradient reconstruction, our protocol is around faster than the original class of protocols and for approximate gradient reconstruction, the mean-squared-error of our reconstructed gradient is several orders of magnitude better.
Paper Structure (10 sections, 14 equations, 7 figures, 2 algorithms)

This paper contains 10 sections, 14 equations, 7 figures, 2 algorithms.

Figures (7)

  • Figure 1: Green/red means that the worker did/didn't process a chunk. (a) System with $N=m=3$. Each worker is assigned two chunks that they process in a top-to-bottom order. $W_3$ is failed and $W_1$ is slow. (b) An arbitrary assignment of chunks to the workers (example also appears in tayyebehM21).
  • Figure 2: Two different relative orderings of the chunks within workers for the same assignment matrix. Individual figures show the calculation of $Q_5$. Similarly, other $Q_i$ values can be computed. $Q_{\max} = Q_5$ for both assignments. Thus, (a) $Q_{\max} = 10$. (b) $Q_{\max} = 9$.
  • Figure 3: (a) Mean-squared error (MSE) vs. $T$ for an approximate GC scenario. Blue curves: proposed protocol with $\ell=1,2,3$, purple curves: corresponding MSE estimates, and red curve: original GC protocol with $\ell=1$. Error bars correspond to one standard deviation. (b) Completion time vs. $\ell$ for exact GC scenario with two different assignment matrices. Blue curves: proposed protocol, green curves: original GC protocol. Error bars correspond to one standard deviation.
  • Figure 4: Error in Lagrange interpolation vs. the number of decimal places (precision) in the evaluation values. The three curves correspond to polynomials of degree 20, 25 and 30 (average of 100 trials).
  • Figure 5: Mean-squared error (MSE) vs. $T$ for an approximate GC scenario corresponding to an assignment matrix stemming from graph $G_2$. Error bars correspond to one standard deviation. Blue curves: proposed protocol with $\ell=1,2,3$, purple curves: corresponding MSE estimates, and red curve: original GC protocol with $\ell=1$.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Remark 1
  • Remark 2
  • Remark 3
  • Claim 1
  • proof
  • Remark 4