Table of Contents
Fetching ...

MESC: Re-thinking Algorithmic Priority and/or Criticality Inversions for Heterogeneous MCSs

Jiapeng Guan, Ran Wei, Dean You, Yingquan Wang, Ruizhe Yang, Hui Wang, Zhe Jiang

TL;DR

MESC targets real-time predictability in heterogeneous MCSs by introducing instruction-level preemption for DNN accelerators, mitigating algorithmic priority and criticality inversions. It comprises Gemmini_rt, a micro-architecture with a default config channel, config-copy buffer, and address remapper, plus OS-level task monitor and scheduler, all integrated with a theoretical WCRT model and a local-memory allocation strategy. Empirical results on an FPGA-based setup show 250x and 300x improvements in resolving priority and criticality inversions, respectively, with modest hardware overhead (~5%). The framework offers a practical, end-to-end solution for fine-grained accelerator preemption that preserves data/config consistency and sustains high schedulability in realistic MCS workloads.

Abstract

Modern Mixed-Criticality Systems (MCSs) rely on hardware heterogeneity to satisfy ever-increasing computational demands. However, most of the heterogeneous co-processors are designed to achieve high throughput, with their micro-architectures executing the workloads in a streaming manner. This streaming execution is often non-preemptive or limited-preemptive, preventing tasks' prioritisation based on their importance and resulting in frequent occurrences of algorithmic priority and/or criticality inversions. Such problems present a significant barrier to guaranteeing the systems' real-time predictability, especially when co-processors dominate the execution of the workloads (e.g., DNNs and transformers). In contrast to existing works that typically enable coarse-grained context switch by splitting the workloads/algorithms, we demonstrate a method that provides fine-grained context switch on a widely used open-source DNN accelerator by enabling instruction-level preemption without any workloads/algorithms modifications. As a systematic solution, we build a real system, i.e., Make Each Switch Count (MESC), from the SoC and ISA to the OS kernel. A theoretical model and analysis are also provided for timing guarantees. Experimental results reveal that, compared to conventional MCSs using non-preemptive DNN accelerators, MESC achieved a 250x and 300x speedup in resolving algorithmic priority and criticality inversions, with less than 5\% overhead. To our knowledge, this is the first work investigating algorithmic priority and criticality inversions for MCSs at the instruction level.

MESC: Re-thinking Algorithmic Priority and/or Criticality Inversions for Heterogeneous MCSs

TL;DR

MESC targets real-time predictability in heterogeneous MCSs by introducing instruction-level preemption for DNN accelerators, mitigating algorithmic priority and criticality inversions. It comprises Gemmini_rt, a micro-architecture with a default config channel, config-copy buffer, and address remapper, plus OS-level task monitor and scheduler, all integrated with a theoretical WCRT model and a local-memory allocation strategy. Empirical results on an FPGA-based setup show 250x and 300x improvements in resolving priority and criticality inversions, respectively, with modest hardware overhead (~5%). The framework offers a practical, end-to-end solution for fine-grained accelerator preemption that preserves data/config consistency and sustains high schedulability in realistic MCS workloads.

Abstract

Modern Mixed-Criticality Systems (MCSs) rely on hardware heterogeneity to satisfy ever-increasing computational demands. However, most of the heterogeneous co-processors are designed to achieve high throughput, with their micro-architectures executing the workloads in a streaming manner. This streaming execution is often non-preemptive or limited-preemptive, preventing tasks' prioritisation based on their importance and resulting in frequent occurrences of algorithmic priority and/or criticality inversions. Such problems present a significant barrier to guaranteeing the systems' real-time predictability, especially when co-processors dominate the execution of the workloads (e.g., DNNs and transformers). In contrast to existing works that typically enable coarse-grained context switch by splitting the workloads/algorithms, we demonstrate a method that provides fine-grained context switch on a widely used open-source DNN accelerator by enabling instruction-level preemption without any workloads/algorithms modifications. As a systematic solution, we build a real system, i.e., Make Each Switch Count (MESC), from the SoC and ISA to the OS kernel. A theoretical model and analysis are also provided for timing guarantees. Experimental results reveal that, compared to conventional MCSs using non-preemptive DNN accelerators, MESC achieved a 250x and 300x speedup in resolving algorithmic priority and criticality inversions, with less than 5\% overhead. To our knowledge, this is the first work investigating algorithmic priority and criticality inversions for MCSs at the instruction level.
Paper Structure (29 sections, 11 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 11 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Scheduling with DNN accelerators (referred to as ACC in figures) featuring different preemption characteristics: (a) non-preemption, (b) limited preemption, and (c) instruction-level preemption. $\tau_i(P_i, L_i)$ represents task $\tau_i$ with priority $P_i$ and criticality $L_i$, with smaller $P_i$ indicating a higher priority.
  • Figure 2: Execution cycles of Gemmini running workloads of varying sizes. The workloads are categorised based on their execution times. Small workload: $[0, 1~\text{million}]$ cycles; medium workloads: $(1~\text{million}, 10~\text{million}]$ cycles; large workloads: $(10~\text{million}, 1~\text{billion}]$ cycles.
  • Figure 3: Architectural overview: the blue box represents the software level, the orange and green boxes correspond to the NPU and CPU (hardware), respectively.
  • Figure 4: The default configuration pathway for load class instructions (shown on the left, in black), and the process utilising step_wise_mvin and step_wise_mvout to transfer matrices into or out of the scratchpad and accumulator (right side).
  • Figure 5: Micro-architecture of the address remapper, featuring the signal laddr for the local memory address to be written or read, and local_addr representing the address calculation rules within the accelerator. Blue lines: the path of the read signals; grey lines: the path of the write signals; black lines: the common pathways.
  • ...and 5 more figures