Table of Contents
Fetching ...

MERE: Hardware-Software Co-Design for Masking Cache Miss Latency in Embedded Processors

Dean You, Jieyu Jiang, Xiaoxuan Wang, Yushu Du, Zhihang Tan, Wenbo Xu, Hui Wang, Jiapeng Guan, Zhenyuan Wang, Ran Wei, Shuai Zhao, Zhe Jiang

TL;DR

This paper tackles the latency penalties caused by irregular memory accesses on embedded scalar in-order cores. It introduces MERE, a full-stack hardware-software co-design that reconstructs sequential runahead for scalar cores and adds an adaptive runahead layer to mitigate cache-contention. The approach includes a dedicated runahead control unit, checkpoint/release circuits, a compact runahead cache, and a customised ISA interfacing with the OS, enabling efficient operation with minimal area/power overhead. FPGA-based evaluation shows MERE reaches about 93.5% of a 2-wide OoO core's performance with under 5% overhead, while the adaptive runahead adds another ~20% performance gain, highlighting practical impact for embedded systems managing irregular workloads.

Abstract

Runahead execution is a technique to mask memory latency caused by irregular memory accesses. By pre-executing the application code during occurrences of long-latency operations and prefetching anticipated cache-missed data into the cache hierarchy, runahead effectively masks memory latency for subsequent cache misses and achieves high prefetching accuracy; however, this technique has been limited to superscalar out-of-order and superscalar in-order cores. For implementation in scalar in-order cores, the challenges of area-/energy-constraint and severe cache contention remain. Here, we build the first full-stack system featuring runahead, MERE, from SoC and a dedicated ISA to the OS and programming model. Through this deployment, we show that enabling runahead in scalar in-order cores is possible, with minimal area and power overheads, while still achieving high performance. By re-constructing the sequential runahead employing a hardware/software co-design approach, the system can be implemented on a mature processor and SoC. Building on this, an adaptive runahead mechanism is proposed to mitigate the severe cache contention in scalar in-order cores. Combining this, we provide a comprehensive solution for embedded processors managing irregular workloads. Our evaluation demonstrates that the proposed MERE attains 93.5% of a 2-wide out-of-order core's performance while constraining area and power overheads below 5%, with the adaptive runahead mechanism delivering an additional 20.1% performance gain through mitigating the severe cache contention issues.

MERE: Hardware-Software Co-Design for Masking Cache Miss Latency in Embedded Processors

TL;DR

This paper tackles the latency penalties caused by irregular memory accesses on embedded scalar in-order cores. It introduces MERE, a full-stack hardware-software co-design that reconstructs sequential runahead for scalar cores and adds an adaptive runahead layer to mitigate cache-contention. The approach includes a dedicated runahead control unit, checkpoint/release circuits, a compact runahead cache, and a customised ISA interfacing with the OS, enabling efficient operation with minimal area/power overhead. FPGA-based evaluation shows MERE reaches about 93.5% of a 2-wide OoO core's performance with under 5% overhead, while the adaptive runahead adds another ~20% performance gain, highlighting practical impact for embedded systems managing irregular workloads.

Abstract

Runahead execution is a technique to mask memory latency caused by irregular memory accesses. By pre-executing the application code during occurrences of long-latency operations and prefetching anticipated cache-missed data into the cache hierarchy, runahead effectively masks memory latency for subsequent cache misses and achieves high prefetching accuracy; however, this technique has been limited to superscalar out-of-order and superscalar in-order cores. For implementation in scalar in-order cores, the challenges of area-/energy-constraint and severe cache contention remain. Here, we build the first full-stack system featuring runahead, MERE, from SoC and a dedicated ISA to the OS and programming model. Through this deployment, we show that enabling runahead in scalar in-order cores is possible, with minimal area and power overheads, while still achieving high performance. By re-constructing the sequential runahead employing a hardware/software co-design approach, the system can be implemented on a mature processor and SoC. Building on this, an adaptive runahead mechanism is proposed to mitigate the severe cache contention in scalar in-order cores. Combining this, we provide a comprehensive solution for embedded processors managing irregular workloads. Our evaluation demonstrates that the proposed MERE attains 93.5% of a 2-wide out-of-order core's performance while constraining area and power overheads below 5%, with the adaptive runahead mechanism delivering an additional 20.1% performance gain through mitigating the severe cache contention issues.

Paper Structure

This paper contains 25 sections, 3 equations, 20 figures, 4 tables, 3 algorithms.

Figures (20)

  • Figure 1: MERE reconstructs the architecture of sequential runahead, software and hardware, In software, miss number 4 is conflict with miss number 1, and miss number 5 is an L1 miss, miss number 1-4/6 are L2 misses. (SRH/ERH: Start/End Runahead; RCU: Runahead Control Unit; MC-CP: Multi-Cycle-CheckPoint; PMU: Prefetch Management Unit.)
  • Figure 2: Average speedup (Norm.Performance) and whole-system power and area overheads for MERE versus a scalar in-order baseline, stream prefetcher and an out-of-order core.
  • Figure 3: Confilct prefetch (The prefetch queue of runahead is located on the left, the cache state checkpoint is located on the right, load4 accesses data3, which will be used later) .
  • Figure 4: The process of indirct memory access.
  • Figure 5: Three case of unbeneficial runahead. (INV $M_1$ means the load of miss number 1 is a invalid instruction, the defination of invalid instruction is in Sec.\ref{['sc:RC2U']}.)
  • ...and 15 more figures