Table of Contents
Fetching ...

SCALE-Sim TPU: Validating and Extending SCALE-Sim for TPUs

Jingtian Dang, Ritik Raj, Changhai Man, Jianming Tong, Tushar Krishna

Abstract

Cycle-accurate simulators are widely used to study systolic accelerators, yet their accuracy and usability are often limited by weak validation against real hardware and poor integration with modern ML compiler stacks. This paper presents SCALE-Sim TPU, a validated and extended version of SCALE-Sim v3 for TPU-style accelerators. Specifically, we make three contributions: (1) We validate SCALE-Sim's systolic GEMM model against measurements on Google TPU v4 and show that simulated cycle counts exhibit a strong linear correlation with hardware latency, enabling a simple cycle-to-latency mapping. (2) We introduce lightweight learned latency models for non-systolic elementwise operations, achieving median relative errors below 3 percent using only tensor size and shape, substantially improving end-to-end latency estimation. (3) We integrate a StableHLO-based frontend that allows workloads from modern ML frameworks such as JAX and PyTorch to be simulated directly via a unified compiler IR. Together, these contributions improve the fidelity, coverage, and practicality of cycle-accurate simulation for whole-model performance analysis on TPUs.

SCALE-Sim TPU: Validating and Extending SCALE-Sim for TPUs

Abstract

Cycle-accurate simulators are widely used to study systolic accelerators, yet their accuracy and usability are often limited by weak validation against real hardware and poor integration with modern ML compiler stacks. This paper presents SCALE-Sim TPU, a validated and extended version of SCALE-Sim v3 for TPU-style accelerators. Specifically, we make three contributions: (1) We validate SCALE-Sim's systolic GEMM model against measurements on Google TPU v4 and show that simulated cycle counts exhibit a strong linear correlation with hardware latency, enabling a simple cycle-to-latency mapping. (2) We introduce lightweight learned latency models for non-systolic elementwise operations, achieving median relative errors below 3 percent using only tensor size and shape, substantially improving end-to-end latency estimation. (3) We integrate a StableHLO-based frontend that allows workloads from modern ML frameworks such as JAX and PyTorch to be simulated directly via a unified compiler IR. Together, these contributions improve the fidelity, coverage, and practicality of cycle-accurate simulation for whole-model performance analysis on TPUs.
Paper Structure (37 sections, 5 figures, 1 table)

This paper contains 37 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the SCALE-Sim TPU Workflow and Contributions. ML programs written in JAX or PyTorch are compiled into StableHLO, which serves as a unified input interface. The StableHLO parser extracts operator metadata and routes systolic-array operations (e.g., GEMM and convolution) to SCALE-Sim v3, while non-systolic operations are handled by lightweight analytic or learned latency models. Circles denote data artifacts or files, and rectangles denote processing components. The green region represents the complete SCALE-Sim TPU toolchain. The blue dashed region highlights legacy components inherited from prior work (SCALE-Sim v3), while the orange dashed region highlights the components introduced or extended in this paper.
  • Figure 2: SCALE-Sim--to--TPU v4 regression for systolic GEMM across three size regimes. Each plot shows SCALE-Sim predicted cycle counts (x-axis) versus measured TPU v4 kernel latency (y-axis) for GEMM workloads executed on a $128\times128$ systolic array. Each point corresponds to one GEMM shape from the sweep, and the solid line shows a least-squares linear regression; the inset reports $R^2$, RMSE, MAE, and sample count. Across all regimes, SCALE-Sim cycle counts show a clear linear relationship with measured TPU execution time, indicating that simulated cycles provide a useful predictor of systolic GEMM latency.
  • Figure 3: bf16 elementwise-add latency vs. tensor size for 1D (32--8192 step 32) and 2D (64--1024 step 64 per dim) sweeps; near-linear scaling with minor shape-dependent fluctuations.
  • Figure 4: Predicted vs. actual GEMM latency on TPU v4. Each point represents a GEMM configuration, grouped by workload size (small, medium, large). The dashed line indicates perfect prediction ($y=x$). While SCALE-Sim TPU preserves the overall scaling trend across sizes ($R^2=0.893$), deviations are more pronounced for medium-sized workloads, leading to higher aggregate error (MAPE = 32.2%).
  • Figure 5: Learned latency model evaluation for non-systolic (elementwise) operations. Estimated versus measured TPU latency for (top) elementwise addition and (bottom) ReLU (maximum) across a diverse set of tensor shapes. Each point is one tensor shape; the dashed diagonal indicates perfect prediction.