Table of Contents
Fetching ...

Scalable Fault-Tolerant MapReduce

Demian Hespe, Lukas Hübner, Charel Mercatoris, Peter Sanders

TL;DR

A low-overhead technique with no additional work during fault-free execution and the negligible expected relative communication overhead of $1/(p-1)$ on PEs and recovery takes approximately the time of processing $1/p$ of the data on the surviving PEs.

Abstract

Supercomputers getting ever larger and energy-efficient is at odds with the reliability of the used hardware. Thus, the time intervals between component failures are decreasing. Contrarily, the latencies for individual operations of coarse-grained big-data tools grow with the number of processors. To overcome the resulting scalability limit, we need to go beyond the current practice of interoperation checkpointing. We give first results on how to achieve this for the popular MapReduce framework where huge multisets are processed by user-defined mapping and reducing functions. We observe that the full state of a MapReduce algorithm is described by its network communication. We present a low-overhead technique with no additional work during fault-free execution and the negligible expected relative communication overhead of $1/(p-1)$ on $p$ PEs. Recovery takes approximately the time of processing $1/p$ of the data on the surviving PEs. We achieve this by backing up self-messages and locally storing all messages sent through the network on the sending and receiving PEs until the next round of global communication. A prototypical implementation already indicates low overhead $<4\,\%$ during fault-free execution.

Scalable Fault-Tolerant MapReduce

TL;DR

A low-overhead technique with no additional work during fault-free execution and the negligible expected relative communication overhead of on PEs and recovery takes approximately the time of processing of the data on the surviving PEs.

Abstract

Supercomputers getting ever larger and energy-efficient is at odds with the reliability of the used hardware. Thus, the time intervals between component failures are decreasing. Contrarily, the latencies for individual operations of coarse-grained big-data tools grow with the number of processors. To overcome the resulting scalability limit, we need to go beyond the current practice of interoperation checkpointing. We give first results on how to achieve this for the popular MapReduce framework where huge multisets are processed by user-defined mapping and reducing functions. We observe that the full state of a MapReduce algorithm is described by its network communication. We present a low-overhead technique with no additional work during fault-free execution and the negligible expected relative communication overhead of on PEs. Recovery takes approximately the time of processing of the data on the surviving PEs. We achieve this by backing up self-messages and locally storing all messages sent through the network on the sending and receiving PEs until the next round of global communication. A prototypical implementation already indicates low overhead during fault-free execution.

Paper Structure

This paper contains 5 sections, 1 equation, 2 figures.

Figures (2)

  • Figure 1: Data flow and stored messages for MapReduce. Colors indicate destinations of data elements. Arrows indicate messages sent during the communication phase of a MapReduce step. After failure of PE $p_2$, we can reconstruct the data held by $p_2$ using the messages sent by other PEs. Black arrows correspond to messages available on the sending PE. The red self-message is the only message not available on another PE and has to be additionally communicated for backup.
  • Figure 2: Overhead for different MapReduce benchmark algorithms. "Fault Tolerance" shows the overhead caused by the backup mechanism in a fault-free scenario, i.e., relative to the running time without fault-tolerance enabled. "Recovery" shows the time taken to recover from a failure relative to the time of a single MapReduce step.