Table of Contents
Fetching ...

Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching

Pengmiao Zhang, Neelesh Gupta, Rajgopal Kannan, Viktor K. Prasanna

TL;DR

This work tackles the practicality challenge of neural network-based memory prefetchers by transferring knowledge from an attention model into a hierarchical table structure, eliminating matrix multiplications during inference. It introduces DART, a table-based prefetcher that maintains high predictive accuracy while achieving huge reductions in arithmetic operations ($99.99\%$) and latency (up to $170\times$ faster than the large model), through a three-stage pipeline of Attention training, Knowledge Distillation, and Layer-wise Tabularization with fine-tuning. The authors develop tabularization kernels based on Product Quantization to convert linear and attention computations into fast lookups, and validate the approach using ChampSim on SPEC workloads, reporting substantial IPC gains over both rule-based and prior NN-based prefetchers. The key contributions include a novel multi-label KD loss with T-Sigmoid, a greedy Table Configurator for latency/storage constraints, and a layer-wise fine-tuning strategy to mitigate error accumulation in the tabularized network. Overall, DART demonstrates that NN-based prefetching can reach practical viability by trading a modest loss in accuracy for large gains in speed and hardware friendliness, closely rivaling or surpassing state-of-the-art baselines under realistic constraints.

Abstract

Attention-based Neural Networks (NN) have demonstrated their effectiveness in accurate memory access prediction, an essential step in data prefetching. However, the substantial computational overheads associated with these models result in high inference latency, limiting their feasibility as practical prefetchers. To close the gap, we propose a new approach based on tabularization that significantly reduces model complexity and inference latency without sacrificing prediction accuracy. Our novel tabularization methodology takes as input a distilled, yet highly accurate attention-based model for memory access prediction and efficiently converts its expensive matrix multiplications into a hierarchy of fast table lookups. As an exemplar of the above approach, we develop DART, a prefetcher comprised of a simple hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99% of arithmetic operations from the large attention-based model and 91.83% from the distilled model. DART accelerates the large model inference by 170x and the distilled model by 9.4x. DART has comparable latency and storage costs as state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC improvement. DART outperforms state-of-the-art NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC improvement, primarily due to its low prefetching latency.

Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching

TL;DR

This work tackles the practicality challenge of neural network-based memory prefetchers by transferring knowledge from an attention model into a hierarchical table structure, eliminating matrix multiplications during inference. It introduces DART, a table-based prefetcher that maintains high predictive accuracy while achieving huge reductions in arithmetic operations () and latency (up to faster than the large model), through a three-stage pipeline of Attention training, Knowledge Distillation, and Layer-wise Tabularization with fine-tuning. The authors develop tabularization kernels based on Product Quantization to convert linear and attention computations into fast lookups, and validate the approach using ChampSim on SPEC workloads, reporting substantial IPC gains over both rule-based and prior NN-based prefetchers. The key contributions include a novel multi-label KD loss with T-Sigmoid, a greedy Table Configurator for latency/storage constraints, and a layer-wise fine-tuning strategy to mitigate error accumulation in the tabularized network. Overall, DART demonstrates that NN-based prefetching can reach practical viability by trading a modest loss in accuracy for large gains in speed and hardware friendliness, closely rivaling or surpassing state-of-the-art baselines under realistic constraints.

Abstract

Attention-based Neural Networks (NN) have demonstrated their effectiveness in accurate memory access prediction, an essential step in data prefetching. However, the substantial computational overheads associated with these models result in high inference latency, limiting their feasibility as practical prefetchers. To close the gap, we propose a new approach based on tabularization that significantly reduces model complexity and inference latency without sacrificing prediction accuracy. Our novel tabularization methodology takes as input a distilled, yet highly accurate attention-based model for memory access prediction and efficiently converts its expensive matrix multiplications into a hierarchy of fast table lookups. As an exemplar of the above approach, we develop DART, a prefetcher comprised of a simple hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99% of arithmetic operations from the large attention-based model and 91.83% from the distilled model. DART accelerates the large model inference by 170x and the distilled model by 9.4x. DART has comparable latency and storage costs as state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC improvement. DART outperforms state-of-the-art NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC improvement, primarily due to its low prefetching latency.
Paper Structure (49 sections, 26 equations, 14 figures, 9 tables, 1 algorithm)

This paper contains 49 sections, 26 equations, 14 figures, 9 tables, 1 algorithm.

Figures (14)

  • Figure 1: Training and query of product quantization.
  • Figure 2: Approach to constructing compact table-based memory access predictor by distilling knowledge from a trained attention-based neural network.
  • Figure 3: Using the proposed approach, we present our LLC prefetcher DART with a table-based predictor distilled from an attention-based memory access prediction model.
  • Figure 4: Training and query of the linear kernel.
  • Figure 5: Training and query of the attention kernel.
  • ...and 9 more figures