Table of Contents
Fetching ...

RedMulE-FT: A Reconfigurable Fault-Tolerant Matrix Multiplication Engine

Philip Wiese, Maurus Item, Luca Bertaccini, Yvan Tortorella, Angelo Garofalo, Luca Benini

TL;DR

RedMulE-FT delivers a runtime-configurable fault-tolerant extension to the RedMulE FP matrix-multiplication accelerator by combining data-path redundancy with parity-based protection of weights and a protected control path. The approach achieves $11\times$ fault reduction in the data path with only $2.3\%$ area overhead, and extends protection to the control path to reach no functional errors after $1\text{M}$ fault injections, at a total area overhead of $25.2\%$ while maintaining $500$ MHz in a $12$ nm process. A runtime-mode switch allows operation in fault-tolerant or high-throughput modes, with retry-based recovery used to handle detected faults. The work demonstrates feasibility through a physical implementation inside a PULP cluster on $12$ nm technology and provides a foundation for configurable reliability in data-parallel accelerators, balancing robustness with performance and area constraints.

Abstract

As safety-critical applications increasingly rely on data-parallel floating-point computations, there is an increasing need for flexible and configurable fault tolerance in parallel floating-point accelerators such as tensor engines. While replication-based methods ensure reliability but incur high area and power costs, error correction codes lack the flexibility to trade off robustness against performance. This work presents RedMulE-FT, a runtime-configurable fault-tolerant extension of the RedMulE matrix multiplication accelerator, balancing fault tolerance, area overhead, and performance impacts. The fault tolerance mode is configured in a shadowed context register file before task execution. By combining replication with error-detecting codes to protect the data path, RedMulE-FT achieves an 11x uncorrected fault reduction with only 2.3% area overhead. Full protection extends to control signals, resulting in no functional errors after 1M injections during our extensive fault injection simulation campaign, with a total area overhead of 25.2% while maintaining a 500 MHz frequency in a 12 nm technology.

RedMulE-FT: A Reconfigurable Fault-Tolerant Matrix Multiplication Engine

TL;DR

RedMulE-FT delivers a runtime-configurable fault-tolerant extension to the RedMulE FP matrix-multiplication accelerator by combining data-path redundancy with parity-based protection of weights and a protected control path. The approach achieves fault reduction in the data path with only area overhead, and extends protection to the control path to reach no functional errors after fault injections, at a total area overhead of while maintaining MHz in a nm process. A runtime-mode switch allows operation in fault-tolerant or high-throughput modes, with retry-based recovery used to handle detected faults. The work demonstrates feasibility through a physical implementation inside a PULP cluster on nm technology and provides a foundation for configurable reliability in data-parallel accelerators, balancing robustness with performance and area constraints.

Abstract

As safety-critical applications increasingly rely on data-parallel floating-point computations, there is an increasing need for flexible and configurable fault tolerance in parallel floating-point accelerators such as tensor engines. While replication-based methods ensure reliability but incur high area and power costs, error correction codes lack the flexibility to trade off robustness against performance. This work presents RedMulE-FT, a runtime-configurable fault-tolerant extension of the RedMulE matrix multiplication accelerator, balancing fault tolerance, area overhead, and performance impacts. The fault tolerance mode is configured in a shadowed context register file before task execution. By combining replication with error-detecting codes to protect the data path, RedMulE-FT achieves an 11x uncorrected fault reduction with only 2.3% area overhead. Full protection extends to control signals, resulting in no functional errors after 1M injections during our extensive fault injection simulation campaign, with a total area overhead of 25.2% while maintaining a 500 MHz frequency in a 12 nm technology.

Paper Structure

This paper contains 13 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Architecture of RedMulE-FT with fault-tolerant data and control paths. (1) Duplicated read requests; filtered duplicate writes. (2) Redundant computation on consecutive rows. (3) Parity-protected broadcasted weights. (4) Final results checked for equality. (A) Duplicated modules with reduced data width for control protection. (B) Duplicated FSMs with parity-protected register file.
  • Figure 2: PULP cluster with full protected RedMulE-FT, implemented in GlobalFoundries' 12LP+ FinFET.