FlexStep: Enabling Flexible Error Detection in Multi/Many-core Real-time Systems
Tinglue Wang, Yiming Li, Wei Tang, Jiapeng Guan, Zhenghui Guo, Renshuang Jiang, Ran Wei, Jing Li, Zhe Jiang
TL;DR
FlexStep addresses inefficiencies of rigid LockStep error detection in contemporary multi-/many-core real-time systems by decoupling reliability checks from fixed core bindings and enabling asynchronous, preemptive verification across configurable cores. It delivers a hardware-software co-design with a novel microarchitecture (RCPM, MAL, DBC) plus a customised ISA and OS scheduling to support dynamic verification and task preemption, backed by a formal scheduling model using partitioned EDF with virtual deadlines. Empirical evaluation on FPGA/Chipyard shows microsecond-scale detection latency, minimal slowdowns (around 1%), and modest area/power overheads, outperforming LockStep and HMR in schedulability and scalability. The work offers a practical path to flexible, reliable, and efficient reliability management in safety-critical systems and provides open-source FlexStep code for broader adoption.
Abstract
Reliability and real-time responsiveness in safety-critical systems have traditionally been achieved using error detection mechanisms, such as LockStep, which require pre-configured checker cores,strict synchronisation between main and checker cores, static error detection regions, or limited preemption capabilities. However, these core-bound hardware mechanisms often lead to significant resource over-provisioning, and diminished real-time responsiveness, particularly in modern systems where tasks with varying reliability requirements are consolidated on shared processors to improve efficiency, reduce costs, and save power. To address these challenges, this work presents FlexStep, a systematic solution that integrates hardware and software across the SoC, ISA, and OS scheduling layers. FlexStep features a novel microarchitecture that supports dynamic core configuration and asynchronous, preemptive error detection. The FlexStep architecture naturally allows for flexible task scheduling and error detection, enabling new scheduling algorithms that enhance both resource efficiency and real-time schedulability. We publicly release FlexStep's source code, at https://anonymous.4open.science/r/FlexStep-DAC25-7B0C.
