Fault-tolerant Reduce and Allreduce operations based on correction
Martin Kuettler, Hermann Haertig
TL;DR
A correction-like communication phase precedes a tree-based phase of Broadcast which provides a Reduce algorithm which is tolerant to a number of failed processes and is combined to provide Allreduce.
Abstract
Implementations of Broadcast based on some information dissemination algorithm -- e.g., gossip or tree-based communication -- followed by a correction algorithm has been proposed previously. This work describes an approach to apply a similar idea to Reduce. In it, a correction-like communication phase precedes a tree-based phase. This provides a Reduce algorithm which is tolerant to a number of failed processes. Semantics of the resulting algorithm are provided and proven. Based on these results, Broadcast and Reduce are combined to provide Allreduce.
