Parendi: Thousand-Way Parallel RTL Simulation
Mahyar Emami, Thomas Bourgeat, James Larus
TL;DR
Parendi tackles the inefficiency of cycle-accurate RTL simulation by exploiting fine-grained parallelism on the Graphcore IPU via a BSP-based RTL compiler. The work analyzes synchronization, communication, and computation costs, and introduces a data-dependence graph partitioning approach with submodular load balancing to map RTL fibers to thousands of IPU tiles. It demonstrates up to 4× speedups over Verilator on large designs, with substantial cost and memory advantages when deployed on IPU-based clouds, and discusses limitations and strategies for scaling to more IPUs. The contributions include the first open-source thousand-way RTL simulator, a customized partitioning compiler, and a comprehensive evaluation across IPU and x64 platforms that informs future parallel RTL simulation on massively parallel architectures.
Abstract
Hardware development critically depends on cycle-accurate RTL simulation. However, as chip complexity increases, conventional single-threaded simulation becomes impractical due to stagnant single-core performance. Parendi is an RTL simulator that addresses this challenge by exploiting the abundant fine-grained parallelism inherent in RTL simulation and efficiently mapping it onto the massively parallel Graphcore IPU (Intelligence Processing Unit) architecture. Parendi scales up to 5888 cores on 4 Graphcore IPU sockets. It allows us to run large RTL designs up to 4$\times$ faster than the most powerful state-of-the-art x64 multicore systems. To achieve this performance, we developed new partitioning and compilation techniques and carefully quantified the synchronization, communication, and computation costs of parallel RTL simulation: The paper comprehensively analyzes these factors and details the strategies that Parendi uses to optimize them.
