MEEK: Re-thinking Heterogeneous Parallel Error Detection Architecture for Real-World OoO Superscalar Processors
Zhe Jiang, Minli Liao, Sam Ainsworth, Dean You, Timothy Jones
TL;DR
This paper tackles fault-tolerance in real-world OoO superscalar processors by proposing MEEK, a CPU/OS codesigned heterogeneous error-detection architecture implemented end-to-end in RTL on an open-source SoC. The solution partitions data extraction from the big core, a dedicated forwarding fabric, and verification via checker threads running on little cores, coordinated through a lightweight ISA and OS scheduler. Key contributions include the first full RTL demonstration of heterogeneous parallel error detection, a non-intrusive data extraction path (DEU), a high-throughput Forwarding Fabric (F^2), and a checker-thread programming model that preserves normal execution while enabling microsecond-scale error detection with modest hardware overhead (~25–26% on a four-core setup). The evaluation shows strong performance/overhead trade-offs, microsecond detection latency, and scalability across additional little cores, offering practical guidance for deploying heterogeneous fault-tolerance in production designs.
Abstract
Heterogeneous parallel error detection is an approach to achieving fault-tolerant processors, leveraging multiple power-efficient cores to re-execute software originally run on a high-performance core. Yet, its complex components, gathering data cross-chip from many parts of the core, raise questions of how to build it into commodity cores without heavy design invasion and extensive re-engineering. We build the first full-RTL design, MEEK, into an open-source SoC, from microarchitecture and ISA to the OS and programming model. We identify and solve bottlenecks and bugs overlooked in previous work, and demonstrate that MEEK offers microsecond-level detection capacity with affordable overheads. By trading off architectural functionalities across codesigned hardware-software layers, MEEK features only light changes to a mature out-of-order superscalar core, simple coordinating software layers, and a few lines of operating-system code. The Repo. of MEEK's source code: https://github.com/SEU-ACAL/reproduce-MEEK-DAC-25.
