Table of Contents
Fetching ...

Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism

Mahyar Emami, Sahand Kashani, Keisuke Kamahori, Mohammad Sepehr Pourghannad, Ritik Raj, James R. Larus

TL;DR

Manticore tackles the persistent bottleneck of cycle-accurate RTL simulation by introducing a hardware accelerator that uses a static bulk-synchronous parallel model. A compiler-driven approach statically schedules inter-core communication and computes across hundreds of simple cores connected by a deterministic NoC, enabling fine-grain parallelism without runtime synchronization costs. The authors demonstrate an FPGA prototype with 225 cores achieving superior performance to a leading software RTL simulator across most benchmarks, and they show meaningful improvements through compiler optimizations such as communication-aware partitioning and custom function synthesis. While still a prototype with limitations (single-clock RTL, timing control not supported, and compile times), Manticore demonstrates that deterministic, BSP-style acceleration can substantially accelerate RTL verification and design exploration, potentially improving designer productivity and time-to-result.

Abstract

The demise of Moore's Law and Dennard Scaling has revived interest in specialized computer architectures and accelerators. Verification and testing of this hardware depend heavily upon cycle-accurate simulation of register-transfer-level (RTL) designs. The fastest software RTL simulators can simulate designs at 1--1000 kHz, i.e., more than three orders of magnitude slower than hardware. Improved simulators can increase designers' productivity by speeding design iterations and permitting more exhaustive exploration. One possibility is to exploit low-level parallelism, as RTL expresses considerable fine-grain concurrency. Unfortunately, state-of-the-art RTL simulators often perform best on a single core since modern processors cannot effectively exploit fine-grain parallelism. This work presents Manticore: a parallel computer designed to accelerate RTL simulation. Manticore uses a static bulk-synchronous parallel (BSP) execution model to eliminate fine-grain synchronization overhead. It relies entirely on a compiler to schedule resources and communication, which is feasible since RTL code contains few divergent execution paths. With static scheduling, communication and synchronization no longer incur runtime overhead, making fine-grain parallelism practical. Moreover, static scheduling dramatically simplifies processor implementation, significantly increasing the number of cores that fit on a chip. Our 225-core FPGA implementation running at 475 MHz outperforms a state-of-the-art RTL simulator running on desktop and server computers in 8 out of 9 benchmarks.

Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism

TL;DR

Manticore tackles the persistent bottleneck of cycle-accurate RTL simulation by introducing a hardware accelerator that uses a static bulk-synchronous parallel model. A compiler-driven approach statically schedules inter-core communication and computes across hundreds of simple cores connected by a deterministic NoC, enabling fine-grain parallelism without runtime synchronization costs. The authors demonstrate an FPGA prototype with 225 cores achieving superior performance to a leading software RTL simulator across most benchmarks, and they show meaningful improvements through compiler optimizations such as communication-aware partitioning and custom function synthesis. While still a prototype with limitations (single-clock RTL, timing control not supported, and compile times), Manticore demonstrates that deterministic, BSP-style acceleration can substantially accelerate RTL verification and design exploration, potentially improving designer productivity and time-to-result.

Abstract

The demise of Moore's Law and Dennard Scaling has revived interest in specialized computer architectures and accelerators. Verification and testing of this hardware depend heavily upon cycle-accurate simulation of register-transfer-level (RTL) designs. The fastest software RTL simulators can simulate designs at 1--1000 kHz, i.e., more than three orders of magnitude slower than hardware. Improved simulators can increase designers' productivity by speeding design iterations and permitting more exhaustive exploration. One possibility is to exploit low-level parallelism, as RTL expresses considerable fine-grain concurrency. Unfortunately, state-of-the-art RTL simulators often perform best on a single core since modern processors cannot effectively exploit fine-grain parallelism. This work presents Manticore: a parallel computer designed to accelerate RTL simulation. Manticore uses a static bulk-synchronous parallel (BSP) execution model to eliminate fine-grain synchronization overhead. It relies entirely on a compiler to schedule resources and communication, which is feasible since RTL code contains few divergent execution paths. With static scheduling, communication and synchronization no longer incur runtime overhead, making fine-grain parallelism practical. Moreover, static scheduling dramatically simplifies processor implementation, significantly increasing the number of cores that fit on a chip. Our 225-core FPGA implementation running at 475 MHz outperforms a state-of-the-art RTL simulator running on desktop and server computers in 8 out of 9 benchmarks.
Paper Structure (52 sections, 16 figures, 8 tables)

This paper contains 52 sections, 16 figures, 8 tables.

Figures (16)

  • Figure 1: An example single-clock netlist (top) and its DAG representation (bottom). Circles represent gates and rectangles represent registers.
  • Figure 2: The static BSP execution model. Each core performs a local computation and then sends its result to the cores that need it for the next computation phase. Cores wait (with compiler-inserted NOps) until all communication completes before starting new computation.
  • Figure 3: A Manticore grid of processors on a uni-directional 2D torus NoC. The cores and the NoC reside in the compute clock domain, while all other components reside in the control clock domain. The privileged core is connected to a cache and can access off-chip DRAM.
  • Figure 4: Manticore compiler. Frontend in red, backend in green. A host communicates with the Manticore accelerator through a runtime shown in blue.
  • Figure 5: Measured simulated model speed on a desktop (left) and server (right). Dashed lines model only synchronization cost (model 1). Solid lines also include i-cache pressure (model 2). Each curve is labeled by the number of instructions in a simulation step. The table shows the maximum speedup of each model.
  • ...and 11 more figures