Table of Contents
Fetching ...

NeuroScalar: A Deep Learning Framework for Fast, Accurate, and In-the-Wild Cycle-Level Performance Prediction

Shayne Wadle, Yanxin Zhang, Vikas Singh, Karthikeyan Sankaralingam

TL;DR

NeuroScalar tackles the bottleneck of cycle-level evaluation by training a compact DL model on microarchitecture-independent features to predict performance of hypothetical designs in-the-wild on production silicon. It couples offline high-fidelity training with online, sampling-based inference and offers a low-power on-chip accelerator (Neutrino) to maximize throughput with minimal overhead, enabling scalable A/B hardware testing on real workloads. The approach delivers robust per-instruction latency predictions and strong downstream utility for design space exploration, achieving substantial gains in speed and energy efficiency while preserving privacy and transparency for end users. Overall, NeuroScalar provides a practical, end-to-end framework to accelerate hardware design cycles through continuous feedback from live workloads.

Abstract

The evaluation of new microprocessor designs is constrained by slow, cycle-accurate simulators that rely on unrepresentative benchmark traces. This paper introduces a novel deep learning framework for high-fidelity, ``in-the-wild'' simulation on production hardware. Our core contribution is a DL model trained on microarchitecture-independent features to predict cycle-level performance for hypothetical processor designs. This unique approach allows the model to be deployed on existing silicon to evaluate future hardware. We propose a complete system featuring a lightweight hardware trace collector and a principled sampling strategy to minimize user impact. This system achieves a simulation speed of 5 MIPS on a commodity GPU, imposing a mere 0.1% performance overhead. Furthermore, our co-designed Neutrino on-chip accelerator improves performance by 85x over the GPU. We demonstrate that this framework enables accurate performance analysis and large-scale hardware A/B testing on a massive scale using real-world applications.

NeuroScalar: A Deep Learning Framework for Fast, Accurate, and In-the-Wild Cycle-Level Performance Prediction

TL;DR

NeuroScalar tackles the bottleneck of cycle-level evaluation by training a compact DL model on microarchitecture-independent features to predict performance of hypothetical designs in-the-wild on production silicon. It couples offline high-fidelity training with online, sampling-based inference and offers a low-power on-chip accelerator (Neutrino) to maximize throughput with minimal overhead, enabling scalable A/B hardware testing on real workloads. The approach delivers robust per-instruction latency predictions and strong downstream utility for design space exploration, achieving substantial gains in speed and energy efficiency while preserving privacy and transparency for end users. Overall, NeuroScalar provides a practical, end-to-end framework to accelerate hardware design cycles through continuous feedback from live workloads.

Abstract

The evaluation of new microprocessor designs is constrained by slow, cycle-accurate simulators that rely on unrepresentative benchmark traces. This paper introduces a novel deep learning framework for high-fidelity, ``in-the-wild'' simulation on production hardware. Our core contribution is a DL model trained on microarchitecture-independent features to predict cycle-level performance for hypothetical processor designs. This unique approach allows the model to be deployed on existing silicon to evaluate future hardware. We propose a complete system featuring a lightweight hardware trace collector and a principled sampling strategy to minimize user impact. This system achieves a simulation speed of 5 MIPS on a commodity GPU, imposing a mere 0.1% performance overhead. Furthermore, our co-designed Neutrino on-chip accelerator improves performance by 85x over the GPU. We demonstrate that this framework enables accurate performance analysis and large-scale hardware A/B testing on a massive scale using real-world applications.

Paper Structure

This paper contains 50 sections, 12 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The NeuroScalar end-to-end workflow, showing the offline training phase performed by the chip designer and the online inference on the end-user's system.
  • Figure 2: GT cycle distributions as percentage of total number of instructions shown per benchmark - the rows.
  • Figure 3: Overall architecture of the proposed LSTM-based cycle predictor.
  • Figure 4: Neutrino Inference Accelerator.
  • Figure 5: Pairwise prediction Acc(round)
  • ...and 2 more figures