Evaluating the Effectiveness of Microarchitectural Hardware Fault Detection for Application-Specific Requirements

Konstantinos-Nikolaos Papadopoulos; Christina Giannoula; Nikolaos-Charalampos Papadopoulos; Nektarios Koziris; José M. G. Merayo; Dionisios N. Pnevmatikatos

Evaluating the Effectiveness of Microarchitectural Hardware Fault Detection for Application-Specific Requirements

Konstantinos-Nikolaos Papadopoulos, Christina Giannoula, Nikolaos-Charalampos Papadopoulos, Nektarios Koziris, José M. G. Merayo, Dionisios N. Pnevmatikatos

TL;DR

This paper tackles the challenge of evaluating hardware fault-detection methods beyond traditional performance and reliability metrics, focusing on safety-critical and domain-diverse applications. It compares three representative approaches—Dual Modular Redundancy (DMR), Redundant Multithreading (R-SMT), and Parallel Error Detection with Heterogeneous Cores (ParDet)—using a unified five-metric framework implemented in a gem5-based environment with fault injection over MiBench workloads. The study finds that microarchitectural methods can achieve detection capabilities comparable to DMR, but exhibit distinct trade-offs in detection latency, IPC, area, and power, with R-SMT best for area/power-critical scenarios, ParDet best for performance-critical tasks, and DMR or lightweight R-SMT for latency-critical cases. These insights, together with the proposed evaluation methodology, offer practical guidance for selecting and tailoring fault-detection strategies to specific application requirements, advancing robust, application-specific computing systems.

Abstract

Reliability is necessary in safety-critical applications spanning numerous domains. Conventional hardware-based fault tolerance techniques, such as component redundancy, ensure reliability, typically at the expense of significantly increased power consumption, and almost double (or more) hardware area. To mitigate these costs, microarchitectural fault tolerance methods try to lower overheads by leveraging microarchitectural insights, but prior evaluations focus primarily on only application performance. As different safety-critical applications prioritize different requirements beyond reliability, evaluating only limited metrics cannot guarantee that microarchitectural methods are practical and usable for all different application scenarios. To this end, in this work, we extensively characterize and compare three fault detection methods, each representing a different major fault detection category, considering real requirements from diverse application settings and employing various important metrics such as design area, power, performance overheads and latency in detection. Through this analysis, we provide important insights which may guide designers in applying the most effective fault tolerance method tailored to specific needs, advancing the overall understanding and development of robust computing systems. For this, we study three methods for hardware error detection within a processor, i.e., (i) Dual Modular Redundancy (DMR) as a conventional method, and (ii) Redundant Multithreading (R-SMT) and (iii) Parallel Error Detection (ParDet) as microarchitecture-level methods. We demonstrate that microarchitectural fault tolerance, i.e., R-SMT and ParDet, is comparably robust compared to conventional approaches (DMR), however, still exhibits unappealing trade-offs for specific real-world use cases, thus precluding their usage in certain application scenarios.

Evaluating the Effectiveness of Microarchitectural Hardware Fault Detection for Application-Specific Requirements

TL;DR

Abstract

Paper Structure (25 sections, 9 figures, 4 tables)

This paper contains 25 sections, 9 figures, 4 tables.

Introduction
Hardware Error Detection Methods
Hardware Redundancy
Redundant Multithreading
Heterogeneous Systems
Key Requirements of Safety-Critical Applications
Description of Evaluated Methods
Spatial Dual Modular Redundancy (DMR)
Redundant Simultaneous Multithreading (R-SMT)
Parallel Error Detection with Heterogenous Cores (ParDet)
Methodology
Evaluation
Analysis of Detection Latency
Re-execution slack in R-SMT
Error detection latency
...and 10 more sections

Figures (9)

Figure 1: Different fault detection mechanisms can exhibit similar performance and reliability but differ significantly in other unevaluated metrics (detection latency and area). Thus, evaluations must consider all relevant metrics. Here, Method A (DMR) and Method B (R-SMT) are assessed against design constraints from a nano-satellite application. Green bars meet the constraints, while red bars violate them.
Figure 2: High-level overview of state-of-the-art hardware detection methods. Green segments represent the main execution, while yellow ones the redundant execution for error detection.
Figure 3: Distribution of re-execution slack when varying the comparison buffer size in R-SMT.
Figure 4: Distribution of detection latency across multiple injection experiments for all evaluated methods.
Figure 5: Detection efficiency in transient errors.
...and 4 more figures

Evaluating the Effectiveness of Microarchitectural Hardware Fault Detection for Application-Specific Requirements

TL;DR

Abstract

Evaluating the Effectiveness of Microarchitectural Hardware Fault Detection for Application-Specific Requirements

Authors

TL;DR

Abstract

Table of Contents

Figures (9)