Table of Contents
Fetching ...

Fault-tolerant Reduce and Allreduce operations based on correction

Martin Kuettler, Hermann Haertig

TL;DR

A correction-like communication phase precedes a tree-based phase of Broadcast which provides a Reduce algorithm which is tolerant to a number of failed processes and is combined to provide Allreduce.

Abstract

Implementations of Broadcast based on some information dissemination algorithm -- e.g., gossip or tree-based communication -- followed by a correction algorithm has been proposed previously. This work describes an approach to apply a similar idea to Reduce. In it, a correction-like communication phase precedes a tree-based phase. This provides a Reduce algorithm which is tolerant to a number of failed processes. Semantics of the resulting algorithm are provided and proven. Based on these results, Broadcast and Reduce are combined to provide Allreduce.

Fault-tolerant Reduce and Allreduce operations based on correction

TL;DR

A correction-like communication phase precedes a tree-based phase of Broadcast which provides a Reduce algorithm which is tolerant to a number of failed processes and is combined to provide Allreduce.

Abstract

Implementations of Broadcast based on some information dissemination algorithm -- e.g., gossip or tree-based communication -- followed by a correction algorithm has been proposed previously. This work describes an approach to apply a similar idea to Reduce. In it, a correction-like communication phase precedes a tree-based phase. This provides a Reduce algorithm which is tolerant to a number of failed processes. Semantics of the resulting algorithm are provided and proven. Based on these results, Broadcast and Reduce are combined to provide Allreduce.
Paper Structure (13 sections, 7 theorems, 2 figures, 5 algorithms)

This paper contains 13 sections, 7 theorems, 2 figures, 5 algorithms.

Key Result

Theorem 1

Let there be no more then $f$ processes that experience a failure, in-operational or pre-operational. Let a I(f)-tree be used, with up-correction groups of size $f+1$. After up-correction, all values of non-failed processes, except for the values of processes grouped with root, are included exactly

Figures (2)

  • Figure 1: The failed process 1 impedes the propagation of data in the tree phase. Arrow labels show the values of which processes are included in the respective message along the path.
  • Figure 2: Up-correction phase and subsequent tree phase with the same failed processes as in \ref{['fig:failure_reduce_tree']}.

Theorems & Definitions (15)

  • Definition
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • Theorem 5
  • ...and 5 more