Table of Contents
Fetching ...

Polaris: Multi-Fidelity Design Space Exploration of Deep Learning Accelerators

Chirag Sakhuja, Charles Hong, Calvin Lin

TL;DR

The paper tackles the high cost of exploring deep learning accelerator design spaces due to slow high-fidelity evaluations. It introduces Starlight, a transfer-learning, deep kernel learning predictor that matches RTL-level $EDP$ accuracy while remaining fast, and Polaris, a Bayesian-optimization–based DSE tool that uses Starlight in a multi-fidelity, RTL-in-loop setting. Key contributions include showing transfer learning reduces high-fidelity data needs by about $61\%$, achieving $EDP$ predictions with $99\%$ RTL accuracy, and delivering designs up to $2.7\times$ better in $EDP$ in around 35 minutes compared to six hours with prior methods. The approach enables rapid hardware/software co-design for DL accelerators and suggests broader applicability to other hardware design spaces where low- and high-fidelity evaluations are closely related.

Abstract

This paper presents a tool for automatically exploring the design space of deep learning accelerators (DLAs). Our main advancement is Starlight, a data-driven performance model that uses transfer learning to bridge the gap between fast, low-fidelity evaluation methods (such as analytical models) and slow, high-fidelity evaluation methods (such as RTL simulation). Starlight is fast: It can provide 6,500 predictions per second, allowing the evaluation of millions of configurations per hour. Starlight is accurate: It predicts the energy-delay product measured by RTL simulation with 99\% accuracy. And Starlight can be trained efficiently: It can be trained with 61\% fewer samples than DOSA's state-of-the-art data-driven performance predictor. Our second contribution is Polaris, a design-space exploration tool that uses Starlight to efficiently search the large, complex hardware/software co-design space of DLAs. In under 35 minutes, Polaris produces DLA designs that match the performance of designs that take six hours to produce with DOSA. And in under 3.3 hours, Polaris produces DLA designs that reduce energy-delay product by 2.7$\times$ over the best designs found by DOSA.

Polaris: Multi-Fidelity Design Space Exploration of Deep Learning Accelerators

TL;DR

The paper tackles the high cost of exploring deep learning accelerator design spaces due to slow high-fidelity evaluations. It introduces Starlight, a transfer-learning, deep kernel learning predictor that matches RTL-level accuracy while remaining fast, and Polaris, a Bayesian-optimization–based DSE tool that uses Starlight in a multi-fidelity, RTL-in-loop setting. Key contributions include showing transfer learning reduces high-fidelity data needs by about , achieving predictions with RTL accuracy, and delivering designs up to better in in around 35 minutes compared to six hours with prior methods. The approach enables rapid hardware/software co-design for DL accelerators and suggests broader applicability to other hardware design spaces where low- and high-fidelity evaluations are closely related.

Abstract

This paper presents a tool for automatically exploring the design space of deep learning accelerators (DLAs). Our main advancement is Starlight, a data-driven performance model that uses transfer learning to bridge the gap between fast, low-fidelity evaluation methods (such as analytical models) and slow, high-fidelity evaluation methods (such as RTL simulation). Starlight is fast: It can provide 6,500 predictions per second, allowing the evaluation of millions of configurations per hour. Starlight is accurate: It predicts the energy-delay product measured by RTL simulation with 99\% accuracy. And Starlight can be trained efficiently: It can be trained with 61\% fewer samples than DOSA's state-of-the-art data-driven performance predictor. Our second contribution is Polaris, a design-space exploration tool that uses Starlight to efficiently search the large, complex hardware/software co-design space of DLAs. In under 35 minutes, Polaris produces DLA designs that match the performance of designs that take six hours to produce with DOSA. And in under 3.3 hours, Polaris produces DLA designs that reduce energy-delay product by 2.7 over the best designs found by DOSA.

Paper Structure

This paper contains 33 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Analytical models can be queried thousands of times an hour, but they are inaccurate, whereas an an RTL simulator is accurate but slow. Our learned model, Starlight, breaks this tradeoff by predicting performance faster than an analytical model and with 99% accuracy when compared to an RTL simulator. These data are collected by Parashar et al. parashar2019timeloop, Karandikar et al. karandikar2018firesim, and Mũnoz-Martinez et al. munoz-martinez2021stonne.
  • Figure 2: A typical deep learning accelerator. Data is tiled in the scratchpad and fed into processing elements (PEs), which compute a convolution operation, using the accumulator to aggregate partial results.
  • Figure 3: A Gaussian process that models a ground truth function that has been sampled at 8 points. The acquisition function---in this case, Expected Improvement---is applied over the Gaussian process and maximized to determine the next sample to evaluate.
  • Figure 4: [fill color=black,inner color=white,]A [fill color=black,inner color=white,]A Starlight-Low is a neural network that predicts the energy-delay product (EDP) of a DLA as measured by a low-fidelity method, namely, an analytical model. The encoder network (in blue dotted pattern) from Starlight-Low is transferred to [fill color=black,inner color=white,]B [fill color=black,inner color=white,]B Starlight, which is a machine learning model based on deep kernel learning that predicts the EDP as measured by a high-fidelity method, namely, an RTL simulator. The decoder network is dropped because it is no longer needed.
  • Figure 5: The 2-D latent space of a VAE trained (a) without a predictor network and (b) simultaneously with a predictor network. Each point represents a HW/SW configuration that is color-coded by the EDP as measured by an analytical model. The predictor network induces structure, as indicated by the gradient of EDPs in the latent space.
  • ...and 8 more figures