Table of Contents
Fetching ...

Exploiting Stragglers in Distributed Computing Systems with Task Grouping

Tharindu Adikari, Haider Al-Lawati, Jason Lam, Zhenhua Hu, Stark C. Draper

TL;DR

A method for exploiting the work completed by stragglers rather than discarding it is proposed to increase the granularity of the assigned work, and to increase the frequency of worker updates.

Abstract

We consider the problem of stragglers in distributed computing systems. Stragglers, which are compute nodes that unpredictably slow down, often increase the completion times of tasks. One common approach to mitigating stragglers is work replication, where only the first completion among replicated tasks is accepted, discarding the others. However, discarding work leads to resource wastage. In this paper, we propose a method for exploiting the work completed by stragglers rather than discarding it. The idea is to increase the granularity of the assigned work, and to increase the frequency of worker updates. We show that the proposed method reduces the completion time of tasks via experiments performed on a simulated cluster as well as on Amazon EC2 with Apache Hadoop.

Exploiting Stragglers in Distributed Computing Systems with Task Grouping

TL;DR

A method for exploiting the work completed by stragglers rather than discarding it is proposed to increase the granularity of the assigned work, and to increase the frequency of worker updates.

Abstract

We consider the problem of stragglers in distributed computing systems. Stragglers, which are compute nodes that unpredictably slow down, often increase the completion times of tasks. One common approach to mitigating stragglers is work replication, where only the first completion among replicated tasks is accepted, discarding the others. However, discarding work leads to resource wastage. In this paper, we propose a method for exploiting the work completed by stragglers rather than discarding it. The idea is to increase the granularity of the assigned work, and to increase the frequency of worker updates. We show that the proposed method reduces the completion time of tasks via experiments performed on a simulated cluster as well as on Amazon EC2 with Apache Hadoop.

Paper Structure

This paper contains 27 sections, 1 equation, 15 figures, 2 algorithms.

Figures (15)

  • Figure 1: Visualizing task processing with $8$ tasks and $2$ workers. The two workers are processing task 4 and 5. If the 'unassigned' stack is empty but the 'assigned' stack is not, a worker will replicate the processing of a task in the 'assigned' stack.
  • Figure 2: Completion time vs. group size parameter $G$. Plots are obtained by averaging 20 repetitions (shaded areas indicate standard deviations). Note that $G=1$ corresponds to standard replication. Setting $G>1$ yields gains on the order of $30-40\%$. In each plot the arrows show the percentage improvement at $G=10$ compared to $G=1$. Making $G$ too large slightly reduces the improvement.
  • Figure 3: An example that motivates the proposed algorithm.
  • Figure 4: Visualizing task assignments in standard replication (top), proposed method (middle), and replication-with-grouping-only (bottom). For the last two, group size $G=3$.
  • Figure 5: The PDFs for $X$, the time required to process a single task. For each PDF the dashed vertical line indicates the expected completion time.
  • ...and 10 more figures