Table of Contents
Fetching ...

Fault Tolerant Reconfigurable ML Multiprocessor

Tangrui Li, Justin Y. Shi, Matteo Spatola, Hongzheng Wang

TL;DR

The paper tackles the escalating demands of neural-network training by proposing a fault-tolerant, reconfigurable von Neumann–style multiprocessor built around ACAN, a tuple-space based framework that decouples programs and data from hardware. By enabling runtime formation of SIMD and MIMD workflows and supporting dynamic device participation, ACAN aims to mitigate single-point failures and checkpoint overhead in heterogeneous clusters, with potential MLIR integration to map diverse accelerators. The authors validate the concept through three simulated experiments focusing on feasibility, adaptability, and robustness, demonstrating stable training and a predictable inverse relationship between timeout and compute power even under faults. If scaled to real hardware, this approach could deliver scalable, resilient NN training across distributed accelerators, while enabling flexible compiler-backed lowering via MLIR and reducing reliance on checkpoints.

Abstract

This paper reports three computational experiments for a von Neumann inspired reconfigurable fault tolerant multiprocessor for neural network (NN) training workflows. The experiments are intended to prove the feasibility of the proposed reconfigurable multiprocessor architecture for non-regular workflows on robustness of adaptability. A potential integration with MLIR compilers is also discussed for integrating diverse accelerator hardware for existing practical applications.

Fault Tolerant Reconfigurable ML Multiprocessor

TL;DR

The paper tackles the escalating demands of neural-network training by proposing a fault-tolerant, reconfigurable von Neumann–style multiprocessor built around ACAN, a tuple-space based framework that decouples programs and data from hardware. By enabling runtime formation of SIMD and MIMD workflows and supporting dynamic device participation, ACAN aims to mitigate single-point failures and checkpoint overhead in heterogeneous clusters, with potential MLIR integration to map diverse accelerators. The authors validate the concept through three simulated experiments focusing on feasibility, adaptability, and robustness, demonstrating stable training and a predictable inverse relationship between timeout and compute power even under faults. If scaled to real hardware, this approach could deliver scalable, resilient NN training across distributed accelerators, while enabling flexible compiler-backed lowering via MLIR and reducing reliance on checkpoints.

Abstract

This paper reports three computational experiments for a von Neumann inspired reconfigurable fault tolerant multiprocessor for neural network (NN) training workflows. The experiments are intended to prove the feasibility of the proposed reconfigurable multiprocessor architecture for non-regular workflows on robustness of adaptability. A potential integration with MLIR compilers is also discussed for integrating diverse accelerator hardware for existing practical applications.

Paper Structure

This paper contains 17 sections, 4 figures.

Figures (4)

  • Figure 1: The MSE loss curve in the feasibility test.
  • Figure 2: The relation between the timeout and the computational power in the adaptability test.
  • Figure 3: The MSE loss curve in the robustness test.
  • Figure 4: The relation between the timeout and the computational power in the robustness test.