Table of Contents
Fetching ...

Parendi: Thousand-Way Parallel RTL Simulation

Mahyar Emami, Thomas Bourgeat, James Larus

TL;DR

Parendi tackles the inefficiency of cycle-accurate RTL simulation by exploiting fine-grained parallelism on the Graphcore IPU via a BSP-based RTL compiler. The work analyzes synchronization, communication, and computation costs, and introduces a data-dependence graph partitioning approach with submodular load balancing to map RTL fibers to thousands of IPU tiles. It demonstrates up to 4× speedups over Verilator on large designs, with substantial cost and memory advantages when deployed on IPU-based clouds, and discusses limitations and strategies for scaling to more IPUs. The contributions include the first open-source thousand-way RTL simulator, a customized partitioning compiler, and a comprehensive evaluation across IPU and x64 platforms that informs future parallel RTL simulation on massively parallel architectures.

Abstract

Hardware development critically depends on cycle-accurate RTL simulation. However, as chip complexity increases, conventional single-threaded simulation becomes impractical due to stagnant single-core performance. Parendi is an RTL simulator that addresses this challenge by exploiting the abundant fine-grained parallelism inherent in RTL simulation and efficiently mapping it onto the massively parallel Graphcore IPU (Intelligence Processing Unit) architecture. Parendi scales up to 5888 cores on 4 Graphcore IPU sockets. It allows us to run large RTL designs up to 4$\times$ faster than the most powerful state-of-the-art x64 multicore systems. To achieve this performance, we developed new partitioning and compilation techniques and carefully quantified the synchronization, communication, and computation costs of parallel RTL simulation: The paper comprehensively analyzes these factors and details the strategies that Parendi uses to optimize them.

Parendi: Thousand-Way Parallel RTL Simulation

TL;DR

Parendi tackles the inefficiency of cycle-accurate RTL simulation by exploiting fine-grained parallelism on the Graphcore IPU via a BSP-based RTL compiler. The work analyzes synchronization, communication, and computation costs, and introduces a data-dependence graph partitioning approach with submodular load balancing to map RTL fibers to thousands of IPU tiles. It demonstrates up to 4× speedups over Verilator on large designs, with substantial cost and memory advantages when deployed on IPU-based clouds, and discusses limitations and strategies for scaling to more IPUs. The contributions include the first open-source thousand-way RTL simulator, a customized partitioning compiler, and a comprehensive evaluation across IPU and x64 platforms that informs future parallel RTL simulation on massively parallel architectures.

Abstract

Hardware development critically depends on cycle-accurate RTL simulation. However, as chip complexity increases, conventional single-threaded simulation becomes impractical due to stagnant single-core performance. Parendi is an RTL simulator that addresses this challenge by exploiting the abundant fine-grained parallelism inherent in RTL simulation and efficiently mapping it onto the massively parallel Graphcore IPU (Intelligence Processing Unit) architecture. Parendi scales up to 5888 cores on 4 Graphcore IPU sockets. It allows us to run large RTL designs up to 4 faster than the most powerful state-of-the-art x64 multicore systems. To achieve this performance, we developed new partitioning and compilation techniques and carefully quantified the synchronization, communication, and computation costs of parallel RTL simulation: The paper comprehensively analyzes these factors and details the strategies that Parendi uses to optimize them.
Paper Structure (27 sections, 1 equation, 17 figures, 3 tables)

This paper contains 27 sections, 1 equation, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Chip growth and single-thread performance processor_scaling_data. The dashed line predicts the core count, assuming linear scaling, necessary to simulate a state-of-the-art chip at the same rate as in 2006.
  • Figure 2: The IPU processor and M2000 server blade.
  • Figure 3: BSP Simulation of an RTL data dependence graph. The graph contains three fibers (f1, f2, f3), partitioned into two processes (p1, p2), running on two threads. a3 is duplicated. The run on the right shows the computation and communication phases, separated by barriers.
  • Figure 4: IPU and x64 PRNG rates
  • Figure 5: Measured communication cycles on the IPU
  • ...and 12 more figures