Table of Contents
Fetching ...

Modifying the Asynchronous Jacobi Method for Data Corruption Resilience

Christopher J. Vogl, Zachary Atkins, Alyson Fox, Agnieszka Miedlar, Colin Ponce

TL;DR

A variant of the asynchronous Jacobi (ASJ) method is developed that achieves resilience to data corruption by rejecting solution approximations from neighbor devices according to a bound derived from convergence theory.

Abstract

Moving scientific computation from high-performance computing (HPC) and cloud computing (CC) environments to devices on the edge, i.e., physically near instruments of interest, has received tremendous interest in recent years. Such edge computing environments can operate on data in-situ, offering enticing benefits over data aggregation to HPC and CC facilities that include avoiding costs of transmission, increased data privacy, and real-time data analysis. Because of the inherent unreliability of edge computing environments, new fault tolerant approaches must be developed before the benefits of edge computing can be realized. Motivated by algorithm-based fault tolerance, a variant of the asynchronous Jacobi (ASJ) method is developed that achieves resilience to data corruption by rejecting solution approximations from neighbor devices according to a bound derived from convergence theory. Numerical results on a two-dimensional Poisson problem show the new rejection criterion, along with a novel approximation to the shortest path length on which the criterion depends, restores convergence for the ASJ variant in the presence of certain types data corruption. Numerical results are obtained for when the singular values in the analytic bound are approximated. A linear system with a more dense sparsity pattern is also explored. All results indicate that successful resilience to data corruption depends on whether the bound tightens fast enough to reject corrupted data before the iteration evolution deviates significantly from that predicted by the convergence theory defining the bound. This observation generalizes to future work on algorithm-based fault tolerance for other asynchronous algorithms, including upcoming approaches that leverage Krylov subspaces.

Modifying the Asynchronous Jacobi Method for Data Corruption Resilience

TL;DR

A variant of the asynchronous Jacobi (ASJ) method is developed that achieves resilience to data corruption by rejecting solution approximations from neighbor devices according to a bound derived from convergence theory.

Abstract

Moving scientific computation from high-performance computing (HPC) and cloud computing (CC) environments to devices on the edge, i.e., physically near instruments of interest, has received tremendous interest in recent years. Such edge computing environments can operate on data in-situ, offering enticing benefits over data aggregation to HPC and CC facilities that include avoiding costs of transmission, increased data privacy, and real-time data analysis. Because of the inherent unreliability of edge computing environments, new fault tolerant approaches must be developed before the benefits of edge computing can be realized. Motivated by algorithm-based fault tolerance, a variant of the asynchronous Jacobi (ASJ) method is developed that achieves resilience to data corruption by rejecting solution approximations from neighbor devices according to a bound derived from convergence theory. Numerical results on a two-dimensional Poisson problem show the new rejection criterion, along with a novel approximation to the shortest path length on which the criterion depends, restores convergence for the ASJ variant in the presence of certain types data corruption. Numerical results are obtained for when the singular values in the analytic bound are approximated. A linear system with a more dense sparsity pattern is also explored. All results indicate that successful resilience to data corruption depends on whether the bound tightens fast enough to reject corrupted data before the iteration evolution deviates significantly from that predicted by the convergence theory defining the bound. This observation generalizes to future work on algorithm-based fault tolerance for other asynchronous algorithms, including upcoming approaches that leverage Krylov subspaces.
Paper Structure (11 sections, 14 equations, 17 figures, 1 table)

This paper contains 11 sections, 14 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: Directed acyclic graph $\mathcal{G}(\mathcal{V}, \mathcal{E})$ illustrating an example two-node evolution of the solution approximations $\mathbf{x}_1^{\nu_1(t)}$ and $\mathbf{x}_2^{\nu_2(t)}$.
  • Figure 1: Ensemble convergence of ASJ-R for various convergence durations on Poisson benchmark problem with $m=144$ (left), $m=400$ (center), and $m=784$ (right). All ensemble runs converge for all durations with the smallest system size; however, the larger system sizes indicate the agents are unable to reach consensus on convergence for the smallest duration.
  • Figure 2: Ensemble convergence of ASJ and ASJ-R with bit flip probability $p=0.01$, with double floating point flips limited to the lower mantissa $\textsc{ie}^3({[0{-}25]})$. Convergence is achieved in all ASJ and ASJ-R runs, with times to solution comparable to the respective baseline (no corruption) values.
  • Figure 3: Ensemble convergence of ASJ and ASJ-R with bit flip probability $p=0.01$, with double floating point flips limited to the sign bit $\textsc{ie}^3({63})$. Convergence is lost for all of the ASJ runs and achieved for all ASJ-R runs, albeit with longer times to solution.
  • Figure 4: Approximate shortest path length $\tilde{s}_i(t)$ used by ASJ-R algorithm with bit flip probability $p=0.01$, with double floating point flips limited to the sign bit $\textsc{ie}^3({63})$ (left: $i=7$, right: $i=9$). The approximate shortest path length reaches $700$ at around the time the stagnation period ends in Figure \ref{['fig:bitflip-sign']} (denoted by dashed black line).
  • ...and 12 more figures