Table of Contents
Fetching ...

ONNXim: A Fast, Cycle-level Multi-core NPU Simulator

Hyungkyu Ham, Wonhyuk Yang, Yunseon Shin, Okkyun Woo, Guseul Heo, Sangyeop Lee, Jongse Park, Gwangsun Kim

TL;DR

ONNXim tackles the challenge of rapidly simulating large, multi-core NPUs under multi-tenant DNN workloads by combining a fast cycle-level core model with cycle-accurate DRAM and NoC simulators. It ingests models in the standard ONNX graph format, leverages ONNX Runtime optimizations to lower graphs to tile-level operations, and executes them on a double-buffered, weight-stationary systolic array core with a deterministic compute path. The approach achieves up to $384\times$ speedups over prior cycle-level simulators while maintaining high accuracy ($MAE = 0.23\%$, $r = 0.99$ vs Gemmini RTL) and enables practical case studies on multi-tenant workloads and LLM attention mechanisms. This enables rapid architectural exploration for server-class NPUs and supports HW-aware design of NPU accelerators for DNN serving, with open-source access at the project repository.

Abstract

As DNNs are widely adopted in various application domains while demanding increasingly higher compute and memory requirements, designing efficient and performant NPUs (Neural Processing Units) is becoming more important. However, existing architectural NPU simulators lack support for high-speed simulation, multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or different deep learning frameworks. To address these limitations, this work proposes ONNXim, a fast cycle-level simulator for multi-core NPUs in DNN serving systems. It takes DNN models represented in the ONNX graph format generated from various deep learning frameworks for ease of simulation. In addition, based on the observation that typical NPU cores process tensor tiles from on-chip scratchpad memory with deterministic compute latency, we forgo a detailed modeling for the computation while still preserving simulation accuracy. ONNXim also preserves dependencies between compute and tile DMAs. Meanwhile, the DRAM and NoC are modeled in cycle-level to properly model contention among multiple cores that can execute different DNN models for multi-tenancy. Consequently, ONNXim is significantly faster than existing simulators (e.g., by up to 384x over Accel-sim) and enables various case studies, such as multi-tenant NPUs, that were previously impractical due to slow speed and/or lack of functionalities. ONNXim is publicly available at https://github.com/PSAL-POSTECH/ONNXim.

ONNXim: A Fast, Cycle-level Multi-core NPU Simulator

TL;DR

ONNXim tackles the challenge of rapidly simulating large, multi-core NPUs under multi-tenant DNN workloads by combining a fast cycle-level core model with cycle-accurate DRAM and NoC simulators. It ingests models in the standard ONNX graph format, leverages ONNX Runtime optimizations to lower graphs to tile-level operations, and executes them on a double-buffered, weight-stationary systolic array core with a deterministic compute path. The approach achieves up to speedups over prior cycle-level simulators while maintaining high accuracy (, vs Gemmini RTL) and enables practical case studies on multi-tenant workloads and LLM attention mechanisms. This enables rapid architectural exploration for server-class NPUs and supports HW-aware design of NPU accelerators for DNN serving, with open-source access at the project repository.

Abstract

As DNNs are widely adopted in various application domains while demanding increasingly higher compute and memory requirements, designing efficient and performant NPUs (Neural Processing Units) is becoming more important. However, existing architectural NPU simulators lack support for high-speed simulation, multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or different deep learning frameworks. To address these limitations, this work proposes ONNXim, a fast cycle-level simulator for multi-core NPUs in DNN serving systems. It takes DNN models represented in the ONNX graph format generated from various deep learning frameworks for ease of simulation. In addition, based on the observation that typical NPU cores process tensor tiles from on-chip scratchpad memory with deterministic compute latency, we forgo a detailed modeling for the computation while still preserving simulation accuracy. ONNXim also preserves dependencies between compute and tile DMAs. Meanwhile, the DRAM and NoC are modeled in cycle-level to properly model contention among multiple cores that can execute different DNN models for multi-tenancy. Consequently, ONNXim is significantly faster than existing simulators (e.g., by up to 384x over Accel-sim) and enables various case studies, such as multi-tenant NPUs, that were previously impractical due to slow speed and/or lack of functionalities. ONNXim is publicly available at https://github.com/PSAL-POSTECH/ONNXim.
Paper Structure (11 sections, 5 figures, 2 tables)

This paper contains 11 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of ONNXim simulator.
  • Figure 2: Comparison of simulation speed over Accel-Sim for GEMM. (X-axis: size of each dimension $N$ for $N\times N\times N$ GEMMs.)
  • Figure 3: (a) End-to-end simulation speedup over Accel-sim for different batch sizes ("B"). (b) Cycle count comparison between ONNXim and Gemmini RTL model for CONV and GEMM operations for an 8$\times$8 systolic array.
  • Figure 4: Distribution of TBT for GPT-3's generation phase when run with ResNet-50 on the same multi-core NPU with different batch sizes.
  • Figure 5: Impact of different attention mechanisms on resource utilization.