Table of Contents
Fetching ...

SCALE-Sim: Systolic CNN Accelerator Simulator

Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, Tushar Krishna

TL;DR

This work tackles the lack of accessible tools for designing and evaluating systolic-array CNN accelerators. It introduces SCALE-Sim, a public, cycle-accurate simulator that models compute, dataflow, memory, and system integration for configurable 2D systolic arrays and CNN workloads. Through MLPerf-based case studies, it reveals how dataflow choices, scratchpad sizing, array shape, and scaling strategy interact to determine end-to-end performance and energy, offering actionable design insights. The tool aims to speed up accelerator development by enabling rapid exploration of architectural trade-offs and their impact within larger system contexts.

Abstract

Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, the research community lacks tools to insights on both the design trade-offs and efficient mapping strategies for systolic-array based accelerators. We introduce Systolic CNN Accelerator Simulator (SCALE-Sim), which is a configurable systolic array based cycle accurate DNN accelerator simulator. SCALE-Sim exposes various micro-architectural features as well as system integration parameters to the designer to enable comprehensive design space exploration. This is the first systolic-array simulator tuned for running DNNs to the best of our knowledge. Using SCALE-Sim, we conduct a suite of case studies and demonstrate the effect of bandwidth, data flow and aspect ratio on the overall runtime and energy of Deep Learning kernels across vision, speech, text, and games. We believe that these insights will be highly beneficial to architects and ML practitioners.

SCALE-Sim: Systolic CNN Accelerator Simulator

TL;DR

This work tackles the lack of accessible tools for designing and evaluating systolic-array CNN accelerators. It introduces SCALE-Sim, a public, cycle-accurate simulator that models compute, dataflow, memory, and system integration for configurable 2D systolic arrays and CNN workloads. Through MLPerf-based case studies, it reveals how dataflow choices, scratchpad sizing, array shape, and scaling strategy interact to determine end-to-end performance and energy, offering actionable design insights. The tool aims to speed up accelerator development by enabling rapid exploration of architectural trade-offs and their impact within larger system contexts.

Abstract

Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, the research community lacks tools to insights on both the design trade-offs and efficient mapping strategies for systolic-array based accelerators. We introduce Systolic CNN Accelerator Simulator (SCALE-Sim), which is a configurable systolic array based cycle accurate DNN accelerator simulator. SCALE-Sim exposes various micro-architectural features as well as system integration parameters to the designer to enable comprehensive design space exploration. This is the first systolic-array simulator tuned for running DNNs to the best of our knowledge. Using SCALE-Sim, we conduct a suite of case studies and demonstrate the effect of bandwidth, data flow and aspect ratio on the overall runtime and energy of Deep Learning kernels across vision, speech, text, and games. We believe that these insights will be highly beneficial to architects and ML practitioners.

Paper Structure

This paper contains 20 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Schematic depicting SCALE-Sim, with inputs and outputs. The tool takes in architecture parameters as a config file, and the workload hyper-parameters as a csv file; and generates cycle accurate traffic traces and simulation summary csv files
  • Figure 2: Schematic showing the mapping in various dataflows (a) Output stationary; (b) Weight stationary; (c) Input stationary
  • Figure 3: Schematic showing the integration model of accelerator in a systems context
  • Figure 4: Cycles reported by SCALE-Sim runs and RTL simulation for Mat-Mat multiplication workloads
  • Figure 5: Chart showing the runtime in cycles to compute all the layers of our workloads while using different dataflows in square arrays with the dimensions (a)128x128, (b)64x64, (c)32x32, (d)16x16, and (e)8x8
  • ...and 5 more figures