Table of Contents
Fetching ...

Instruction Block Movement with Coupled High-Level Program Sequencing

Shyam Murthy, Gurindar S. Sohi

TL;DR

The paper addresses core front-end inefficiencies by proposing an Instruction Presending Unit (IPU) that uses a shadow program representation to move code blocks and iTLB/BTB entries from secondary to primary structures just in time. It introduces a hardware shadow model consisting of a Fragment Table (FT), Dual Target Table (DTT), and Overflow Regions Table (ORT) to capture high-level call graph fragments and potential next fragments, guiding autonomous pre-sending. The IPU can operate with single or multiple next fragments, handles indirect calls and loops, and uses set-associative FT structures to reduce conflicts while maintaining fast direct-path accesses. Evaluation shows that presending can dramatically reduce L1i misses and MPKI, enable small primary BTBs without sacrificing performance, and maintain manageable L3 traffic, indicating strong practical potential for front-end acceleration in large code footprint workloads. The approach decouples front-end progress from branch predictors and BTB accuracy, enabling robust performance gains across diverse workloads with significant implications for processor design and memory hierarchy optimization.

Abstract

Efficiency in instruction fetching is critical to performance, and this requires the primary structures -- L1 instruction caches (L1i), branch target buffers (BTB) and instruction TLBs (iTLB) -- to have the requisite information when needed. This paper proposes a high-level program sequencing mechanism and a coupled technique for block movement, instruction presending, where instruction cache blocks, BTB entries, and iTLB entries are autonomously moved (or sent) from the secondary to the primary structures in a "just in time" fashion so that they are available when needed. Empirical results are presented to demonstrate the efficacy of the high-level sequencing mechanism and block movement. Presending is especially effective for benchmarks with a high base MPKI, where the movement of instruction blocks (and BTB/iTLB entries) from secondary to primary structures is frequent. Presending reduces the number of misses in primary structures by an order of magnitude as compared to state-of-the-art instruction prefetching schemes, in many cases, while allowing the processor to operate with small-sized primary BTBs. This reduction in misses results in performance improvements in cases where front-end efficiency is important.

Instruction Block Movement with Coupled High-Level Program Sequencing

TL;DR

The paper addresses core front-end inefficiencies by proposing an Instruction Presending Unit (IPU) that uses a shadow program representation to move code blocks and iTLB/BTB entries from secondary to primary structures just in time. It introduces a hardware shadow model consisting of a Fragment Table (FT), Dual Target Table (DTT), and Overflow Regions Table (ORT) to capture high-level call graph fragments and potential next fragments, guiding autonomous pre-sending. The IPU can operate with single or multiple next fragments, handles indirect calls and loops, and uses set-associative FT structures to reduce conflicts while maintaining fast direct-path accesses. Evaluation shows that presending can dramatically reduce L1i misses and MPKI, enable small primary BTBs without sacrificing performance, and maintain manageable L3 traffic, indicating strong practical potential for front-end acceleration in large code footprint workloads. The approach decouples front-end progress from branch predictors and BTB accuracy, enabling robust performance gains across diverse workloads with significant implications for processor design and memory hierarchy optimization.

Abstract

Efficiency in instruction fetching is critical to performance, and this requires the primary structures -- L1 instruction caches (L1i), branch target buffers (BTB) and instruction TLBs (iTLB) -- to have the requisite information when needed. This paper proposes a high-level program sequencing mechanism and a coupled technique for block movement, instruction presending, where instruction cache blocks, BTB entries, and iTLB entries are autonomously moved (or sent) from the secondary to the primary structures in a "just in time" fashion so that they are available when needed. Empirical results are presented to demonstrate the efficacy of the high-level sequencing mechanism and block movement. Presending is especially effective for benchmarks with a high base MPKI, where the movement of instruction blocks (and BTB/iTLB entries) from secondary to primary structures is frequent. Presending reduces the number of misses in primary structures by an order of magnitude as compared to state-of-the-art instruction prefetching schemes, in many cases, while allowing the processor to operate with small-sized primary BTBs. This reduction in misses results in performance improvements in cases where front-end efficiency is important.
Paper Structure (30 sections, 12 figures, 4 tables, 1 algorithm)

This paper contains 30 sections, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of the Instruction Presending Unit (IPU)
  • Figure 2: Entry of a Fragment Table
  • Figure 3: Fragment Table Construction Example
  • Figure 4: RPKI FDIP and Send
  • Figure 5: iTLB MPKI
  • ...and 7 more figures