Table of Contents
Fetching ...

Fault Oblivious Eigenvalue Solver

Jayanta Mukherjee, Xuejiao Kang, David F. Gleich, Ahmed Sameh, Ananth Grama

TL;DR

The paper tackles fault tolerance in large-scale eigenvalue computations by using erasure-coded augmentations to convert $Ax = \lambda x$ into a fault-oblivious generalized eigenproblem with augmented matrices $\tilde{A}$ and $\tilde{B}$. It proves eigenvalue equivalence and develops practical recovery schemes, then implements erasure-coded variants of the Power Method and TraceMin, validating them on dense and sparse benchmarks. The results show minimal overhead and robust convergence under single and multiple faults, offering a scalable alternative to checkpoint-restart for resilient eigenvalue computations. This approach provides a principled, low-overhead path to reliable eigenvalue solvers on fault-prone HPC platforms, with potential applicability to a broad class of linear-algebra kernels.

Abstract

Eigenvalue problems serve as fundamental substrates for applications in large-scale scientific simulations and machine learning, often requiring computation on massively parallel platforms. As these platforms scale to hundreds of thousands of cores, hardware failures become a significant challenge to reliability and efficiency. In this paper, we propose and analyze a novel fault-tolerant eigenvalue solver based on erasure-coded computations -- a technique that enhances resilience by augmenting the system with redundant data a priori. This transformation reformulates the original eigenvalue problem as a generalized eigenvalue problem, enabling fault-oblivious computation while preserving numerical stability and convergence properties. We formulate the augmentation scheme, establish the necessary conditions for the encoded blocks, and prove the relationship between the original and transformed problems. We implement an erasure-coded TraceMin eigensolver and demonstrate its effectiveness in extracting eigenvalues in the presence of faults. Our experimental results show that the proposed solver incurs minimal computational overhead, maintains robust convergence, and scales efficiently with the number of faults, making it a practical solution for resilient eigenvalue computations in large-scale systems.

Fault Oblivious Eigenvalue Solver

TL;DR

The paper tackles fault tolerance in large-scale eigenvalue computations by using erasure-coded augmentations to convert into a fault-oblivious generalized eigenproblem with augmented matrices and . It proves eigenvalue equivalence and develops practical recovery schemes, then implements erasure-coded variants of the Power Method and TraceMin, validating them on dense and sparse benchmarks. The results show minimal overhead and robust convergence under single and multiple faults, offering a scalable alternative to checkpoint-restart for resilient eigenvalue computations. This approach provides a principled, low-overhead path to reliable eigenvalue solvers on fault-prone HPC platforms, with potential applicability to a broad class of linear-algebra kernels.

Abstract

Eigenvalue problems serve as fundamental substrates for applications in large-scale scientific simulations and machine learning, often requiring computation on massively parallel platforms. As these platforms scale to hundreds of thousands of cores, hardware failures become a significant challenge to reliability and efficiency. In this paper, we propose and analyze a novel fault-tolerant eigenvalue solver based on erasure-coded computations -- a technique that enhances resilience by augmenting the system with redundant data a priori. This transformation reformulates the original eigenvalue problem as a generalized eigenvalue problem, enabling fault-oblivious computation while preserving numerical stability and convergence properties. We formulate the augmentation scheme, establish the necessary conditions for the encoded blocks, and prove the relationship between the original and transformed problems. We implement an erasure-coded TraceMin eigensolver and demonstrate its effectiveness in extracting eigenvalues in the presence of faults. Our experimental results show that the proposed solver incurs minimal computational overhead, maintains robust convergence, and scales efficiently with the number of faults, making it a practical solution for resilient eigenvalue computations in large-scale systems.

Paper Structure

This paper contains 14 sections, 3 theorems, 32 equations, 11 figures, 1 table, 4 algorithms.

Key Result

Lemma 3.1

\newlabelth:nullspace0 The matrices $\Tilde{A}$ and $\Tilde{B}$ from eq:augmented2 have a joint null space $$

Figures (11)

  • Figure 1: Erasure-Coded TraceMin Solver Flow.
  • Figure 1: MNIST Train Dataset with 0.1% Erased
  • Figure 2: Coding Matrix
  • Figure 2: Timing Breakdown
  • Figure 3: Sparse bcsstk17 Train Dataset with 0.1% Erased
  • ...and 6 more figures

Theorems & Definitions (6)

  • Lemma 3.1: Null Space in Augmented Generalized Eigenvalue System
  • Proof 1
  • Theorem 3.2: Matrix Pencil Equivalence
  • Proof 2
  • Theorem 3.3: Eigenvalue Equivalence
  • Proof 3