FORTALESA: Fault-Tolerant Reconfigurable Systolic Array for DNN Inference
Natalia Cherezova, Artur Jutman, Maksim Jenihhin
TL;DR
FORTALESA tackles the reliability challenges of DNN inference on systolic-array accelerators by introducing a run-time reconfigurable SA with three execution modes that adapt to layer vulnerability. The approach couples heterogeneous layer-to-mode mapping with a novel fault-propagation–based reliability assessment, enabling targeted protection of registers and MAC units without interrupting inference. Key contributions include the DRG and TRG fault-tolerant modes, four implementation options, and a fast method to explore the reliability–performance design space, yielding up to 3× speedup and substantial resource savings versus static redundancy. The work demonstrates favorable trade-offs through hardware parametrization, detailed reliability analyses on standard CNNs, and comparisons to state-of-the-art, with future plans for ASIC deployment and open-source release, underscoring practical impact for robust, efficient DNN accelerators.
Abstract
The emergence of Deep Neural Networks (DNNs) in mission- and safety-critical applications brings their reliability to the front. High performance demands of DNNs require the use of specialized hardware accelerators. Systolic array architecture is widely used in DNN accelerators due to its parallelism and regular structure. This work presents a run-time reconfigurable systolic array architecture with three execution modes and four implementation options. All four implementations are evaluated in terms of resource utilization, throughput, and fault tolerance improvement. The proposed architecture is used for reliability enhancement of DNN inference on systolic array through heterogeneous mapping of different network layers to different execution modes. The approach is supported by a novel reliability assessment method based on fault propagation analysis. It is used for the exploration of the appropriate execution mode--layer mapping for DNN inference. The proposed architecture efficiently protects registers and MAC units of systolic array PEs from transient and permanent faults. The reconfigurability feature enables a speedup of up to $3\times$, depending on layer vulnerability. Furthermore, it requires $6\times$ fewer resources compared to static redundancy and $2.5\times$ fewer resources compared to the previously proposed solution for transient faults.
