Table of Contents
Fetching ...

Integrating Prefetcher Selection with Dynamic Request Allocation Improves Prefetching Efficiency

Mengming Li, Qijun Zhang, Yongqing Ren, Zhiyao Xie

TL;DR

Alecto addresses conflicts in hybrid hardware prefetching by coupling per-PC prefetcher identification with dynamic demand request allocation. It uses three tables (Allocation, Sample, Sandbox) to identify suitable prefetchers, allocate training requests, and filter duplicates, achieving superior accuracy, coverage, and timeliness while reducing prefetcher-table energy and storage overhead. Compared with RL-based Bandit and other baselines, Alecto delivers consistent gains across single- and multi-core workloads and under temporal prefetching scenarios, with notable improvements in memory-intensive benchmarks and robust scalability. This framework enables broader adoption of multi-prefetcher designs by mitigating table conflicts and enabling efficient, fine-grained scheduling of prefetchers with minimal overhead.

Abstract

Hardware prefetching plays a critical role in hiding the off-chip DRAM latency. The complexity of applications results in a wide variety of memory access patterns, prompting the development of numerous cache-prefetching algorithms. Consequently, commercial processors often employ a hybrid of these algorithms to enhance the overall prefetching performance. Nonetheless, since these prefetchers share hardware resources, conflicts arising from competing prefetching requests can negate the benefits of hardware prefetching. Under such circumstances, several prefetcher selection algorithms have been proposed to mitigate conflicts between prefetchers. However, these prior solutions suffer from two limitations. First, the input demand request allocation is inaccurate. Second, the prefetcher selection criteria are coarse-grained. In this paper, we address both limitations by introducing an efficient and widely applicable prefetcher selection algorithm--Alecto, which tailors the demand requests for each prefetcher. Every demand request is first sent to Alecto to identify suitable prefetchers before being routed to prefetchers for training and prefetching. Our analysis shows that Alecto is adept at not only harmonizing prefetching accuracy, coverage, and timeliness but also significantly enhancing the utilization of the prefetcher table, which is vital for temporal prefetching. Alecto outperforms the state-of-the-art RL-based prefetcher selection algorithm--Bandit by 2.76% in single-core, and 7.56% in eight-core. For memory-intensive benchmarks, Alecto outperforms Bandit by 5.25%. Alecto consistently delivers state-of-the-art performance in scheduling various types of cache prefetchers. In addition to the performance improvement, Alecto can reduce the energy consumption associated with accessing the prefetchers' table by 48%, while only adding less than 1 KB of storage overhead.

Integrating Prefetcher Selection with Dynamic Request Allocation Improves Prefetching Efficiency

TL;DR

Alecto addresses conflicts in hybrid hardware prefetching by coupling per-PC prefetcher identification with dynamic demand request allocation. It uses three tables (Allocation, Sample, Sandbox) to identify suitable prefetchers, allocate training requests, and filter duplicates, achieving superior accuracy, coverage, and timeliness while reducing prefetcher-table energy and storage overhead. Compared with RL-based Bandit and other baselines, Alecto delivers consistent gains across single- and multi-core workloads and under temporal prefetching scenarios, with notable improvements in memory-intensive benchmarks and robust scalability. This framework enables broader adoption of multi-prefetcher designs by mitigating table conflicts and enabling efficient, fine-grained scheduling of prefetchers with minimal overhead.

Abstract

Hardware prefetching plays a critical role in hiding the off-chip DRAM latency. The complexity of applications results in a wide variety of memory access patterns, prompting the development of numerous cache-prefetching algorithms. Consequently, commercial processors often employ a hybrid of these algorithms to enhance the overall prefetching performance. Nonetheless, since these prefetchers share hardware resources, conflicts arising from competing prefetching requests can negate the benefits of hardware prefetching. Under such circumstances, several prefetcher selection algorithms have been proposed to mitigate conflicts between prefetchers. However, these prior solutions suffer from two limitations. First, the input demand request allocation is inaccurate. Second, the prefetcher selection criteria are coarse-grained. In this paper, we address both limitations by introducing an efficient and widely applicable prefetcher selection algorithm--Alecto, which tailors the demand requests for each prefetcher. Every demand request is first sent to Alecto to identify suitable prefetchers before being routed to prefetchers for training and prefetching. Our analysis shows that Alecto is adept at not only harmonizing prefetching accuracy, coverage, and timeliness but also significantly enhancing the utilization of the prefetcher table, which is vital for temporal prefetching. Alecto outperforms the state-of-the-art RL-based prefetcher selection algorithm--Bandit by 2.76% in single-core, and 7.56% in eight-core. For memory-intensive benchmarks, Alecto outperforms Bandit by 5.25%. Alecto consistently delivers state-of-the-art performance in scheduling various types of cache prefetchers. In addition to the performance improvement, Alecto can reduce the energy consumption associated with accessing the prefetchers' table by 48%, while only adding less than 1 KB of storage overhead.

Paper Structure

This paper contains 35 sections, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Comparison of prefetcher table misses in the samecomposite prefetchers without dynamic demand request allocation (DDRA) and Alecto that utilizes DDRA. With efficient demand request allocation, Alecto proves to significantly reduce conflicts that occur within the prefetchers’ table.
  • Figure 2: Memory access patterns of 459.GemsFDTD.
  • Figure 3: Comparison of prefetcher selection algorithms. (a) DOL selects prefetchers in the allocation stage. It sequentially passes the demand request through all prefetchers. (b) IPCP selects prefetchers in the prefetch stage. It statically prioritizes the prefetching requests from different prefetchers. (c) RL-based schemes select prefetchers in the prefetch stage. It controls the outputs of prefetchers and applies identical rules for all memory accesses. (d) Alecto selects prefetchers in the allocation stage. It identifies suitable prefetchers for each memory access, then dynamically allocates demand requests to identified prefetchers.
  • Figure 4: The overall framework of Alecto. It consists of an Allocation Table, which enables fine-grained prefetcher identification and dynamic request allocation. It also includes a Sample Table and Sandbox Table for information collection. Additionally, the Sandbox Table functions as a prefetch filter.
  • Figure 5: The state machine of Allocation Table. For every memory access instruction, each prefetcher has three states: Un-Identified (UI) indicates the suitability of this prefetcher is unidentified; Identified and Aggressive (IA) means the prefetcher is efficient and its prefetching degree should be promoted; Identified and Blocked (IB) applies when a prefetcher is deemed unsuitable for processing the memory access instructions.
  • ...and 15 more figures