Table of Contents
Fetching ...

Triangel: A High-Performance, Accurate, Timely On-Chip Temporal Prefetcher

Sam Ainsworth, Lev Mukhanov

TL;DR

Triangel is introduced, a prefetcher that extends Triage with novel sampling-based methodologies to allow it to be aggressive and timely when the prefetcher is able to handle observed long-term patterns, and to avoid inaccurate prefetches when less able to do so.

Abstract

Temporal prefetching, where correlated pairs of addresses are logged and replayed on repeat accesses, has recently become viable in commercial designs. Arm's latest processors include Correlating Miss Chaining prefetchers, which store such patterns in a partition of the on-chip cache. However, the state-of-the-art on-chip temporal prefetcher in the literature, Triage, features some design inconsistencies and inaccuracies that pose challenges for practical implementation. We first examine and design fixes for these inconsistencies to produce an implementable baseline. We then introduce Triangel, a prefetcher that extends Triage with novel sampling-based methodologies to allow it to be aggressive and timely when the prefetcher is able to handle observed long-term patterns, and to avoid inaccurate prefetches when less able to do so. Triangel gives a 26.4% speedup compared to a baseline system with a conventional stride prefetcher alone, compared with 9.3% for Triage at degree 1 and 14.2% at degree 4. At the same time Triangel only increases memory traffic by 10% relative to baseline, versus 28.5% for Triage.

Triangel: A High-Performance, Accurate, Timely On-Chip Temporal Prefetcher

TL;DR

Triangel is introduced, a prefetcher that extends Triage with novel sampling-based methodologies to allow it to be aggressive and timely when the prefetcher is able to handle observed long-term patterns, and to avoid inaccurate prefetches when less able to do so.

Abstract

Temporal prefetching, where correlated pairs of addresses are logged and replayed on repeat accesses, has recently become viable in commercial designs. Arm's latest processors include Correlating Miss Chaining prefetchers, which store such patterns in a partition of the on-chip cache. However, the state-of-the-art on-chip temporal prefetcher in the literature, Triage, features some design inconsistencies and inaccuracies that pose challenges for practical implementation. We first examine and design fixes for these inconsistencies to produce an implementable baseline. We then introduce Triangel, a prefetcher that extends Triage with novel sampling-based methodologies to allow it to be aggressive and timely when the prefetcher is able to handle observed long-term patterns, and to avoid inaccurate prefetches when less able to do so. Triangel gives a 26.4% speedup compared to a baseline system with a conventional stride prefetcher alone, compared with 9.3% for Triage at degree 1 and 14.2% at degree 4. At the same time Triangel only increases memory traffic by 10% relative to baseline, versus 28.5% for Triage.
Paper Structure (45 sections, 1 equation, 20 figures, 2 tables)

This paper contains 45 sections, 1 equation, 20 figures, 2 tables.

Figures (20)

  • Figure 1: The basic operation of Triage Triage-MICRO19TriageISR-ToC22. On a cache miss (or tagged prefetch hit), the PC is used to index the training table. The previous address is used as an index to train the Markov history table. The current access is then looked up in the Markov table to generate a prefetch. Not shown: index- and target-compression mechanisms (\ref{['ssec:metadataformat']}), confidence bits (\ref{['ssec:confidence']}), partition sizing (\ref{['ssec:triagesizing']}).
  • Figure 2: Fields in the Markov table Markov segment of the cache in our reimplementation of Triage Triage-MICRO19TriageISR-ToC22. Its lookup address is indexed by cache set and sub-set (\ref{['ssec:index']}, and tagged by an XOR hash of the full address (tag-#). The prefetch target is generated by using the LUT-idx bits as an index into the 1024-entry lookup table, which is then combined with the Offset and 6 zero bits for cache-line alignment.
  • Figure 3: The structure of Triangel. Like Triage, it tracks per-PC miss sequences $(x,y)$ in the training table, and stores and replays them using a Markov table Markov inside a partition of the L3 cache. Triangel adds four new structures: a History Sampler, which randomly samples the training table to observe long-term patterns, a Second-Chance Sampler to identify inexact sequences that still give accurate prefetches, a Metadata Reuse Buffer to eliminate duplicate L3 Markov-partition accesses from high-degree prefetches, and a Set Dueller, to choose the partitioning of L3-data-cache versus Markov table that optimizes hit rates.
  • Figure 4: An example of the classifications performed by Triangel's samplers. "_" signifies an arbitrary address. For PC 0x42, sampling (x.y) reveals that $x$ is repeated within a region short enough to be stored in our Markov table (ReuseConf). Since y is also accessed following x on x's repeat, the pattern repeats (PatternConf) and thus temporal prefetching is accurate. For 0x63, when e repeats, it is followed by h rather than the f we expect. However, Second-Chance Sampling (\ref{['sssec:secondchance']}) reveals that we access f nearby, and so a prefetch to f at (_,e) would be used before eviction, despite the imperfect sequence. Note the Markov table can only store one target per index, so will store only one of (e,f) or (e,h) at any given point.
  • Figure 5: Fields in the training table, which is indexed and tagged by PC. Bold fields are new to Triangel, others are taken from Triage (assuming Triage uses a saturating counter of the same size as ReuseConf for HawkEye classification HawkEye).
  • ...and 15 more figures