ONNXim: A Fast, Cycle-level Multi-core NPU Simulator

Hyungkyu Ham; Wonhyuk Yang; Yunseon Shin; Okkyun Woo; Guseul Heo; Sangyeop Lee; Jongse Park; Gwangsun Kim

ONNXim: A Fast, Cycle-level Multi-core NPU Simulator

Hyungkyu Ham, Wonhyuk Yang, Yunseon Shin, Okkyun Woo, Guseul Heo, Sangyeop Lee, Jongse Park, Gwangsun Kim

TL;DR

ONNXim tackles the challenge of rapidly simulating large, multi-core NPUs under multi-tenant DNN workloads by combining a fast cycle-level core model with cycle-accurate DRAM and NoC simulators. It ingests models in the standard ONNX graph format, leverages ONNX Runtime optimizations to lower graphs to tile-level operations, and executes them on a double-buffered, weight-stationary systolic array core with a deterministic compute path. The approach achieves up to $384\times$ speedups over prior cycle-level simulators while maintaining high accuracy ($MAE = 0.23\%$, $r = 0.99$ vs Gemmini RTL) and enables practical case studies on multi-tenant workloads and LLM attention mechanisms. This enables rapid architectural exploration for server-class NPUs and supports HW-aware design of NPU accelerators for DNN serving, with open-source access at the project repository.

Abstract

As DNNs are widely adopted in various application domains while demanding increasingly higher compute and memory requirements, designing efficient and performant NPUs (Neural Processing Units) is becoming more important. However, existing architectural NPU simulators lack support for high-speed simulation, multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or different deep learning frameworks. To address these limitations, this work proposes ONNXim, a fast cycle-level simulator for multi-core NPUs in DNN serving systems. It takes DNN models represented in the ONNX graph format generated from various deep learning frameworks for ease of simulation. In addition, based on the observation that typical NPU cores process tensor tiles from on-chip scratchpad memory with deterministic compute latency, we forgo a detailed modeling for the computation while still preserving simulation accuracy. ONNXim also preserves dependencies between compute and tile DMAs. Meanwhile, the DRAM and NoC are modeled in cycle-level to properly model contention among multiple cores that can execute different DNN models for multi-tenancy. Consequently, ONNXim is significantly faster than existing simulators (e.g., by up to 384x over Accel-sim) and enables various case studies, such as multi-tenant NPUs, that were previously impractical due to slow speed and/or lack of functionalities. ONNXim is publicly available at https://github.com/PSAL-POSTECH/ONNXim.

ONNXim: A Fast, Cycle-level Multi-core NPU Simulator

TL;DR

speedups over prior cycle-level simulators while maintaining high accuracy (

vs Gemmini RTL) and enables practical case studies on multi-tenant workloads and LLM attention mechanisms. This enables rapid architectural exploration for server-class NPUs and supports HW-aware design of NPU accelerators for DNN serving, with open-source access at the project repository.

Abstract

Paper Structure (11 sections, 5 figures, 2 tables)

This paper contains 11 sections, 5 figures, 2 tables.

Introduction
ONNXim
Front End
NPU Microarchitecture Model
Evaluation
Methodology
Simulation Speed
Validation with Core RTL Model
Case Study on a Multi-tenant Workload
Case Study on the Impact of Attention Mechanism
Conclusion

Figures (5)

Figure 1: Overview of ONNXim simulator.
Figure 2: Comparison of simulation speed over Accel-Sim for GEMM. (X-axis: size of each dimension $N$ for $N\times N\times N$ GEMMs.)
Figure 3: (a) End-to-end simulation speedup over Accel-sim for different batch sizes ("B"). (b) Cycle count comparison between ONNXim and Gemmini RTL model for CONV and GEMM operations for an 8$\times$8 systolic array.
Figure 4: Distribution of TBT for GPT-3's generation phase when run with ResNet-50 on the same multi-core NPU with different batch sizes.
Figure 5: Impact of different attention mechanisms on resource utilization.

ONNXim: A Fast, Cycle-level Multi-core NPU Simulator

TL;DR

Abstract

ONNXim: A Fast, Cycle-level Multi-core NPU Simulator

Authors

TL;DR

Abstract

Table of Contents

Figures (5)