A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering

Alexandre Valentin Jamet; Georgios Vavouliotis; Daniel A. Jiménez; Lluc Alvarez; Marc Casas

A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering

Alexandre Valentin Jamet, Georgios Vavouliotis, Daniel A. Jiménez, Lluc Alvarez, Marc Casas

TL;DR

The paper addresses memory-system bottlenecks in data-intensive workloads by introducing the Two Level Perceptron (TLP) predictor, a hardware solution that unifies off-chip prediction and L1D prefetch filtering using two connected perceptrons (FLP and SLP) with a compact 7 KB per-core storage footprint. By selectively delaying off-chip predictions for high-confidence cases and leveraging FLP output to filter L1D prefetches, TLP reduces DRAM transactions and yields meaningful speedups across a wide set of single-core and multi-core workloads, including graph-processing GAP workloads. The approach outperforms state-of-the-art off-chip predictors (Hermes) and prefetch filters (PPF), achieving up to 11.8% geometric-mean speedups and significant DRAM-traffic reductions, while maintaining effectiveness across different L1D prefetchers. The results demonstrate that joint off-chip prediction and prefetch filtering via a low-overhead, multi-level perceptron architecture offers practical gains for memory-subsystem optimization in diverse workloads.

Abstract

To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose the Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will be off-chip with adaptive prefetch filtering at the first-level data cache (L1D). TLP is composed of two connected microarchitectural perceptron predictors, named First Level Predictor (FLP) and Second Level Predictor (SLP). FLP performs accurate off-chip prediction by using several program features based on virtual addresses and a novel selective delay component. The novelty of SLP relies on leveraging off-chip prediction to drive L1D prefetch filtering by using physical addresses and the FLP prediction as features. TLP constitutes the first hardware proposal targeting both off-chip prediction and prefetch filtering using a multi-level perceptron hardware approach. TLP only requires 7KB of storage. To demonstrate the benefits of TLP we compare its performance with state-of-the-art approaches using off-chip prediction and prefetch filtering on a wide range of single-core and multi-core workloads. Our experiments show that TLP reduces the average DRAM transactions by 30.7% and 17.7%, as compared to a baseline using state-of-the-art cache prefetchers but no off-chip prediction mechanism, across the single-core and multi-core workloads, respectively, while recent work significantly increases DRAM transactions. As a result, TLP achieves geometric mean performance speedups of 6.2% and 11.8% across single-core and multi-core workloads, respectively. In addition, our evaluation demonstrates that TLP is effective independently of the L1D prefetching logic.

A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering

TL;DR

Abstract

Paper Structure (41 sections, 17 figures, 5 tables)

This paper contains 41 sections, 17 figures, 5 tables.

Introduction
Background
Off-Chip Prediction
Prefetch Filtering
Motivation
Cache Behavior of Modern Workloads
Impact of Hermes
DRAM Transactions
Analysis of Hermes Predictions
Off-Chip Prediction for L1D Prefetch Filtering
Two Level Perceptron Prediction
First Level Perceptron (FLP) Predictor
Second Level Perceptron (SLP) Predictor
Building a Multi-Level Perceptron Predictor
TLP Hardware Requirements and Latency
...and 26 more sections

Figures (17)

Figure 1: MPKI of all caches (L1D, L2C, LLC) across the SPEC (SPEC CPU 2006 and SPEC CPU 2017) and GAP workloads.
Figure 2: Increase in DRAM transactions due to Hermes off-chip predictions relative to a baseline without off-chip prediction mechanism. Lower is better.
Figure 3: Increase in DRAM transactions due to Hermes off-chip predictions relative to a baseline without off-chip prediction mechanism in the 4-core context. The x-axis ticks represent 200 different 4-core workload mixes of SPEC and GAP workloads. Lower is better.
Figure 4: Location of a block upon a Hermes off-chip prediction.
Figure 5: Location where the inaccurate L1D prefetch requests are served across two state-of-the-art L1D prefetchers. Both SPEC and GAP workloads are separately sorted based on LLC MPKI, similar to Figure \ref{['fig:hermes_single_core_motivation']}.
...and 12 more figures

A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering

TL;DR

Abstract

A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering

Authors

TL;DR

Abstract

Table of Contents

Figures (17)