Table of Contents
Fetching ...

MEEK: Re-thinking Heterogeneous Parallel Error Detection Architecture for Real-World OoO Superscalar Processors

Zhe Jiang, Minli Liao, Sam Ainsworth, Dean You, Timothy Jones

TL;DR

This paper tackles fault-tolerance in real-world OoO superscalar processors by proposing MEEK, a CPU/OS codesigned heterogeneous error-detection architecture implemented end-to-end in RTL on an open-source SoC. The solution partitions data extraction from the big core, a dedicated forwarding fabric, and verification via checker threads running on little cores, coordinated through a lightweight ISA and OS scheduler. Key contributions include the first full RTL demonstration of heterogeneous parallel error detection, a non-intrusive data extraction path (DEU), a high-throughput Forwarding Fabric (F^2), and a checker-thread programming model that preserves normal execution while enabling microsecond-scale error detection with modest hardware overhead (~25–26% on a four-core setup). The evaluation shows strong performance/overhead trade-offs, microsecond detection latency, and scalability across additional little cores, offering practical guidance for deploying heterogeneous fault-tolerance in production designs.

Abstract

Heterogeneous parallel error detection is an approach to achieving fault-tolerant processors, leveraging multiple power-efficient cores to re-execute software originally run on a high-performance core. Yet, its complex components, gathering data cross-chip from many parts of the core, raise questions of how to build it into commodity cores without heavy design invasion and extensive re-engineering. We build the first full-RTL design, MEEK, into an open-source SoC, from microarchitecture and ISA to the OS and programming model. We identify and solve bottlenecks and bugs overlooked in previous work, and demonstrate that MEEK offers microsecond-level detection capacity with affordable overheads. By trading off architectural functionalities across codesigned hardware-software layers, MEEK features only light changes to a mature out-of-order superscalar core, simple coordinating software layers, and a few lines of operating-system code. The Repo. of MEEK's source code: https://github.com/SEU-ACAL/reproduce-MEEK-DAC-25.

MEEK: Re-thinking Heterogeneous Parallel Error Detection Architecture for Real-World OoO Superscalar Processors

TL;DR

This paper tackles fault-tolerance in real-world OoO superscalar processors by proposing MEEK, a CPU/OS codesigned heterogeneous error-detection architecture implemented end-to-end in RTL on an open-source SoC. The solution partitions data extraction from the big core, a dedicated forwarding fabric, and verification via checker threads running on little cores, coordinated through a lightweight ISA and OS scheduler. Key contributions include the first full RTL demonstration of heterogeneous parallel error detection, a non-intrusive data extraction path (DEU), a high-throughput Forwarding Fabric (F^2), and a checker-thread programming model that preserves normal execution while enabling microsecond-scale error detection with modest hardware overhead (~25–26% on a four-core setup). The evaluation shows strong performance/overhead trade-offs, microsecond detection latency, and scalability across additional little cores, offering practical guidance for deploying heterogeneous fault-tolerance in production designs.

Abstract

Heterogeneous parallel error detection is an approach to achieving fault-tolerant processors, leveraging multiple power-efficient cores to re-execute software originally run on a high-performance core. Yet, its complex components, gathering data cross-chip from many parts of the core, raise questions of how to build it into commodity cores without heavy design invasion and extensive re-engineering. We build the first full-RTL design, MEEK, into an open-source SoC, from microarchitecture and ISA to the OS and programming model. We identify and solve bottlenecks and bugs overlooked in previous work, and demonstrate that MEEK offers microsecond-level detection capacity with affordable overheads. By trading off architectural functionalities across codesigned hardware-software layers, MEEK features only light changes to a mature out-of-order superscalar core, simple coordinating software layers, and a few lines of operating-system code. The Repo. of MEEK's source code: https://github.com/SEU-ACAL/reproduce-MEEK-DAC-25.

Paper Structure

This paper contains 19 sections, 10 figures, 3 tables, 2 algorithms.

Figures (10)

  • Figure 1: Re-constructed heterogeneous parallel error detection architecture (RCP: Register Checkpoint; S/ERCP: Start/End RCP LSL: Load-Store Log): an application thread on big core 0 is divided into three Segs. using RCPs, replayed and verified on little core 1 and 2.
  • Figure 2: An overview of MEEK. (DEU: Data Extraction Unit; LSQ: Load-Store Queue; PRFs: Physical Register Files; CSRs: Control and Status Registers; HM-NoC: Half-duplex Multicast Network-on-Chip; MSU: Mode Switch Unit; LSL: Load-Store Log; M-Bus: Memory Bus). At hardware: a a non-intrusive DEU is deployed in the big core, collecting data from various locations without disrupting the core's execution; b a bespoke forwarding fabric F$^2$ is developed, prioritizing and distributing the data only to the relevant little cores; c dual-mode little cores are designed for correctness checking or workload execution. At software: d a customized ISA abstracts the control interface for e, the scheduler in the OS, enabling flexible management of verification and workload execution on the little cores.
  • Figure 3: Big core microarchitecture, extracting status data (red: ROB to DEU; blue: DEU to PRFs): a at commit time, opcode and function code are routed; b DEU determines whether to extract status data; c if so, a signal is routed and preempts the PRF controller for reading.
  • Figure 4: Little core microarchitecture, upgraded with the MSU and LSL: the a MSU servers as the control engineer, and the b LSL buffers packets from F$^2$. While running the checker thread, the MSU manages its status, and memory data is fetched from the LSL.
  • Figure 5: A deadlock occurs when the little core tries to acquire the lock held by the big core. By building synchronization between cores and I/Os, preventing the checker ever needing to claim locks, the deadlock is resolved (M.Status: Memory Status; PFH: Page Fault Handler).
  • ...and 5 more figures