Table of Contents
Fetching ...

Constable: Improving Performance and Power Efficiency by Safely Eliminating Load Instruction Execution

Rahul Bera, Adithya Ranganathan, Joydeep Rakshit, Sujit Mahto, Anant V. Nori, Jayesh Gaur, Ataberk Olgun, Konstantinos Kanellopoulos, Mohammad Sadrosadati, Sreenivas Subramoney, Onur Mutlu

TL;DR

Constable proposes a purely-microarchitectural technique called Constable, that safely eliminates the execution of load instructions, and improves performance while reducing the core dynamic power consumption over a strong baseline system that implements MRN and other dynamic instruction optimizations.

Abstract

Load instructions often limit instruction-level parallelism (ILP) in modern processors due to data and resource dependences they cause. Prior techniques like Load Value Prediction (LVP) and Memory Renaming (MRN) mitigate load data dependence by predicting the data value of a load instruction. However, they fail to mitigate load resource dependence as the predicted load instruction gets executed nonetheless. Our goal in this work is to improve ILP by mitigating both load data dependence and resource dependence. To this end, we propose a purely-microarchitectural technique called Constable, that safely eliminates the execution of load instructions. Constable dynamically identifies load instructions that have repeatedly fetched the same data from the same load address. We call such loads likely-stable. For every likely-stable load, Constable (1) tracks modifications to its source architectural registers and memory location via lightweight hardware structures, and (2) eliminates the execution of subsequent instances of the load instruction until there is a write to its source register or a store or snoop request to its load address. Our extensive evaluation using a wide variety of 90 workloads shows that Constable improves performance by 5.1% while reducing the core dynamic power consumption by 3.4% on average over a strong baseline system that implements MRN and other dynamic instruction optimizations (e.g., move and zero elimination, constant and branch folding). In presence of 2-way simultaneous multithreading (SMT), Constable's performance improvement increases to 8.8% over the baseline system. When combined with a state-of-the-art load value predictor (EVES), Constable provides an additional 3.7% and 7.8% average performance benefit over the load value predictor alone, in the baseline system without and with 2-way SMT, respectively.

Constable: Improving Performance and Power Efficiency by Safely Eliminating Load Instruction Execution

TL;DR

Constable proposes a purely-microarchitectural technique called Constable, that safely eliminates the execution of load instructions, and improves performance while reducing the core dynamic power consumption over a strong baseline system that implements MRN and other dynamic instruction optimizations.

Abstract

Load instructions often limit instruction-level parallelism (ILP) in modern processors due to data and resource dependences they cause. Prior techniques like Load Value Prediction (LVP) and Memory Renaming (MRN) mitigate load data dependence by predicting the data value of a load instruction. However, they fail to mitigate load resource dependence as the predicted load instruction gets executed nonetheless. Our goal in this work is to improve ILP by mitigating both load data dependence and resource dependence. To this end, we propose a purely-microarchitectural technique called Constable, that safely eliminates the execution of load instructions. Constable dynamically identifies load instructions that have repeatedly fetched the same data from the same load address. We call such loads likely-stable. For every likely-stable load, Constable (1) tracks modifications to its source architectural registers and memory location via lightweight hardware structures, and (2) eliminates the execution of subsequent instances of the load instruction until there is a write to its source register or a store or snoop request to its load address. Our extensive evaluation using a wide variety of 90 workloads shows that Constable improves performance by 5.1% while reducing the core dynamic power consumption by 3.4% on average over a strong baseline system that implements MRN and other dynamic instruction optimizations (e.g., move and zero elimination, constant and branch folding). In presence of 2-way simultaneous multithreading (SMT), Constable's performance improvement increases to 8.8% over the baseline system. When combined with a state-of-the-art load value predictor (EVES), Constable provides an additional 3.7% and 7.8% average performance benefit over the load value predictor alone, in the baseline system without and with 2-way SMT, respectively.

Paper Structure

This paper contains 55 sections, 23 figures, 4 tables.

Figures (23)

  • Figure 1: Two component operations of a load instruction execution and their associated pipeline resources.
  • Figure 2: Execution timeline of a code example in a processor (a) without a load value predictor (LVP), (b) with LVP, and (c) with LVP and load elimination.
  • Figure 3: (a) Fraction of dynamic loads that are global-stable. Distribution of global-stable loads by their (b) addressing mode and (b) inter-occurrence distance. (d) Distribution of inter-occurrence distance of global-stable loads from each addressing mode.
  • Figure 5: Code example and disassembly from 541.leela_r and 557.xz_r of SPEC CPU 2017 suite.
  • Figure 6: (a) Fraction of total execution cycles where at least one load port is utilized (we call such cycles load-utilized). (b) Categorization of load-utilized cycles based on whether or not a global-stable load utilizes a load port.
  • ...and 18 more figures