RedMulE-FT: A Reconfigurable Fault-Tolerant Matrix Multiplication Engine
Philip Wiese, Maurus Item, Luca Bertaccini, Yvan Tortorella, Angelo Garofalo, Luca Benini
TL;DR
RedMulE-FT delivers a runtime-configurable fault-tolerant extension to the RedMulE FP matrix-multiplication accelerator by combining data-path redundancy with parity-based protection of weights and a protected control path. The approach achieves $11\times$ fault reduction in the data path with only $2.3\%$ area overhead, and extends protection to the control path to reach no functional errors after $1\text{M}$ fault injections, at a total area overhead of $25.2\%$ while maintaining $500$ MHz in a $12$ nm process. A runtime-mode switch allows operation in fault-tolerant or high-throughput modes, with retry-based recovery used to handle detected faults. The work demonstrates feasibility through a physical implementation inside a PULP cluster on $12$ nm technology and provides a foundation for configurable reliability in data-parallel accelerators, balancing robustness with performance and area constraints.
Abstract
As safety-critical applications increasingly rely on data-parallel floating-point computations, there is an increasing need for flexible and configurable fault tolerance in parallel floating-point accelerators such as tensor engines. While replication-based methods ensure reliability but incur high area and power costs, error correction codes lack the flexibility to trade off robustness against performance. This work presents RedMulE-FT, a runtime-configurable fault-tolerant extension of the RedMulE matrix multiplication accelerator, balancing fault tolerance, area overhead, and performance impacts. The fault tolerance mode is configured in a shadowed context register file before task execution. By combining replication with error-detecting codes to protect the data path, RedMulE-FT achieves an 11x uncorrected fault reduction with only 2.3% area overhead. Full protection extends to control signals, resulting in no functional errors after 1M injections during our extensive fault injection simulation campaign, with a total area overhead of 25.2% while maintaining a 500 MHz frequency in a 12 nm technology.
