Fault Tolerant Reconfigurable ML Multiprocessor
Tangrui Li, Justin Y. Shi, Matteo Spatola, Hongzheng Wang
TL;DR
The paper tackles the escalating demands of neural-network training by proposing a fault-tolerant, reconfigurable von Neumann–style multiprocessor built around ACAN, a tuple-space based framework that decouples programs and data from hardware. By enabling runtime formation of SIMD and MIMD workflows and supporting dynamic device participation, ACAN aims to mitigate single-point failures and checkpoint overhead in heterogeneous clusters, with potential MLIR integration to map diverse accelerators. The authors validate the concept through three simulated experiments focusing on feasibility, adaptability, and robustness, demonstrating stable training and a predictable inverse relationship between timeout and compute power even under faults. If scaled to real hardware, this approach could deliver scalable, resilient NN training across distributed accelerators, while enabling flexible compiler-backed lowering via MLIR and reducing reliance on checkpoints.
Abstract
This paper reports three computational experiments for a von Neumann inspired reconfigurable fault tolerant multiprocessor for neural network (NN) training workflows. The experiments are intended to prove the feasibility of the proposed reconfigurable multiprocessor architecture for non-regular workflows on robustness of adaptability. A potential integration with MLIR compilers is also discussed for integrating diverse accelerator hardware for existing practical applications.
