SCALE-Sim v3: A modular cycle-accurate systolic accelerator simulator for end-to-end system analysis

Ritik Raj; Sarbartha Banerjee; Nikhil Chandra; Zishen Wan; Jianming Tong; Ananda Samajdar; Tushar Krishna

SCALE-Sim v3: A modular cycle-accurate systolic accelerator simulator for end-to-end system analysis

Ritik Raj, Sarbartha Banerjee, Nikhil Chandra, Zishen Wan, Jianming Tong, Ananda Samajdar, Tushar Krishna

TL;DR

Scale-Sim v3 addresses the need for end-to-end, cycle-accurate analysis of modern AI accelerators by adding multi-core spatio-temporal partitioning, sparse SpMM, Ramulator-based memory modeling, data-layout-aware stall analysis, and Accelergy-based energy estimation. The approach integrates modular components to deliver detailed latency, bandwidth, and energy/power insights across diverse workloads and configurations. Notable findings include that a $128\times128$ array is $6.53\times$ faster than a $32\times32$ array for ViT-base by latency, while a $32\times32$ array is $2.86\times$ more energy-efficient; for EdP, a $64\times64$ array often yields the best results. Overall, SCALE-Sim v3 enables richer design-space exploration by providing accurate full-system metrics that capture memory, data-layout, sparsity, and energy interactions in modern accelerators.

Abstract

The rapid advancements in AI, scientific computing, and high-performance computing (HPC) have driven the need for versatile and efficient hardware accelerators. Existing tools like SCALE-Sim v2 provide valuable cycle-accurate simulations for systolic-array-based architectures but fall short in supporting key modern features such as sparsity, multi-core scalability, and comprehensive memory analysis. To address these limitations, we present SCALE-Sim v3, a modular, cycle-accurate simulator that extends the capabilities of its predecessor. SCALE-Sim v3 introduces five significant enhancements: multi-core simulation with spatio-temporal partitioning and hierarchical memory structures, support for sparse matrix multiplications (SpMM) with layer-wise and row-wise sparsity, integration with Ramulator for detailed DRAM analysis, precise data layout modeling to minimize memory stalls, and energy and power estimation via Accelergy. These improvements enable deeper end-to-end system analysis for modern AI accelerators, accommodating a wide variety of systems and workloads and providing detailed full-system insights into latency, bandwidth, and power efficiency. A 128x128 array is 6.53x faster than a 32x32 array for ViT-base, using only latency as a metric. However, SCALE-Sim v3 finds that 32x32 is 2.86x more energy-efficient due to better utilization and lower leakage energy. For EdP, 64x64 outperforms both 128x128 and 32x32 for ViT-base. SCALE-Sim v2 shows a 21% reduction in compute cycles for six ResNet18 layers using weight-stationary (WS) dataflow compared to output-stationary (OS). However, when factoring in DRAM stalls, OS dataflow exhibits 30.1% lower execution cycles compared to WS, highlighting the critical role of detailed DRAM analysis.

SCALE-Sim v3: A modular cycle-accurate systolic accelerator simulator for end-to-end system analysis

TL;DR

Abstract

SCALE-Sim v3: A modular cycle-accurate systolic accelerator simulator for end-to-end system analysis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)