Table of Contents
Fetching ...

FlexStep: Enabling Flexible Error Detection in Multi/Many-core Real-time Systems

Tinglue Wang, Yiming Li, Wei Tang, Jiapeng Guan, Zhenghui Guo, Renshuang Jiang, Ran Wei, Jing Li, Zhe Jiang

TL;DR

FlexStep addresses inefficiencies of rigid LockStep error detection in contemporary multi-/many-core real-time systems by decoupling reliability checks from fixed core bindings and enabling asynchronous, preemptive verification across configurable cores. It delivers a hardware-software co-design with a novel microarchitecture (RCPM, MAL, DBC) plus a customised ISA and OS scheduling to support dynamic verification and task preemption, backed by a formal scheduling model using partitioned EDF with virtual deadlines. Empirical evaluation on FPGA/Chipyard shows microsecond-scale detection latency, minimal slowdowns (around 1%), and modest area/power overheads, outperforming LockStep and HMR in schedulability and scalability. The work offers a practical path to flexible, reliable, and efficient reliability management in safety-critical systems and provides open-source FlexStep code for broader adoption.

Abstract

Reliability and real-time responsiveness in safety-critical systems have traditionally been achieved using error detection mechanisms, such as LockStep, which require pre-configured checker cores,strict synchronisation between main and checker cores, static error detection regions, or limited preemption capabilities. However, these core-bound hardware mechanisms often lead to significant resource over-provisioning, and diminished real-time responsiveness, particularly in modern systems where tasks with varying reliability requirements are consolidated on shared processors to improve efficiency, reduce costs, and save power. To address these challenges, this work presents FlexStep, a systematic solution that integrates hardware and software across the SoC, ISA, and OS scheduling layers. FlexStep features a novel microarchitecture that supports dynamic core configuration and asynchronous, preemptive error detection. The FlexStep architecture naturally allows for flexible task scheduling and error detection, enabling new scheduling algorithms that enhance both resource efficiency and real-time schedulability. We publicly release FlexStep's source code, at https://anonymous.4open.science/r/FlexStep-DAC25-7B0C.

FlexStep: Enabling Flexible Error Detection in Multi/Many-core Real-time Systems

TL;DR

FlexStep addresses inefficiencies of rigid LockStep error detection in contemporary multi-/many-core real-time systems by decoupling reliability checks from fixed core bindings and enabling asynchronous, preemptive verification across configurable cores. It delivers a hardware-software co-design with a novel microarchitecture (RCPM, MAL, DBC) plus a customised ISA and OS scheduling to support dynamic verification and task preemption, backed by a formal scheduling model using partitioned EDF with virtual deadlines. Empirical evaluation on FPGA/Chipyard shows microsecond-scale detection latency, minimal slowdowns (around 1%), and modest area/power overheads, outperforming LockStep and HMR in schedulability and scalability. The work offers a practical path to flexible, reliable, and efficient reliability management in safety-critical systems and provides open-source FlexStep code for broader adoption.

Abstract

Reliability and real-time responsiveness in safety-critical systems have traditionally been achieved using error detection mechanisms, such as LockStep, which require pre-configured checker cores,strict synchronisation between main and checker cores, static error detection regions, or limited preemption capabilities. However, these core-bound hardware mechanisms often lead to significant resource over-provisioning, and diminished real-time responsiveness, particularly in modern systems where tasks with varying reliability requirements are consolidated on shared processors to improve efficiency, reduce costs, and save power. To address these challenges, this work presents FlexStep, a systematic solution that integrates hardware and software across the SoC, ISA, and OS scheduling layers. FlexStep features a novel microarchitecture that supports dynamic core configuration and asynchronous, preemptive error detection. The FlexStep architecture naturally allows for flexible task scheduling and error detection, enabling new scheduling algorithms that enhance both resource efficiency and real-time schedulability. We publicly release FlexStep's source code, at https://anonymous.4open.science/r/FlexStep-DAC25-7B0C.

Paper Structure

This paper contains 18 sections, 8 figures, 3 tables, 3 algorithms.

Figures (8)

  • Figure 1: Scheduling on different dual-core architectures. Tasks $\tau_1, \tau_2, \tau_3$ have implicit deadlines and worst-case execution time (WCET) of 15, 15, 5, respectively. $\tau_1$ and $\tau_3$ are non-verification tasks that do not require error checking. An emergency occurs upon the arrival of the first job of task $\tau_{2}$, requiring its first 10 units of work checked for errors.
  • Figure 2: FlexStep overview. At the hardware level (shown in red box), colored modules represent functional units added by FlexStep: orange hues denote identical functionality, while yellow and blue hues highlight variations in functionality for reconfigured main and checker cores, respectively. a Register Checkpoint Management (RCPM) manages Register Checkpoints and instruction counts, providing checker cores with execution boundaries and snapshots for verification (Sec. \ref{['RCPM']}). b Memory Access Log (MAL) tracks and records memory accesses for correctness checks (Sec. \ref{['MAL']}). c Data Buffering and Channelling (DBC) manages asynchronous data buffering and communication between cores via system interconnect (Sec. \ref{['DBC']}). At the software level (shown in green box), FlexStep provides d customised ISA (Tab. \ref{['table:ISA']}) to support the control interface between hardware microarchitecture and OS, enabling e a control flow capable of performing context switching between verification and non-verification tasks and more flexible scheduling.
  • Figure 3: Checking segments: their components, main and checker execution of them, and them of default length or interrupted by kernel.
  • Figure 4: Performance slowdown of Parsec and SPEC06 using LockStep, FlexStep and Nzdc.
  • Figure 5: Percentage of schedulable task sets ($y$-axis) under LockStep, HMR, and FlexStep with increasing task set utilisations ($x$-axis) and varying system configurations: $m$ (number of cores), $n$ (number of tasks), $\alpha$ (percentage of double-check tasks), and $\beta$ (percentage of triple-check tasks).
  • ...and 3 more figures