Table of Contents
Fetching ...

parti-gem5: gem5's Timing Mode Parallelised

José Cubero-Cascante, Niko Zurstraßen, Jörn Nöller, Rainer Leupers, Jan Moritz Joseph

TL;DR

This paper introduces parti-gem5, a parallel timing-mode extension to gem5 that enables timing-mode simulations on multi-core hosts and extends prior work by supporting detailed timing models (MinorCPU, O3CPU) and the Ruby cache/interconnect. It presents thread-safe strategies for Ruby messaging and non-coherent traffic across time-domain partitions, achieving up to 42.7x speedups on a 120-core ARM MPSoC with total-time errors typically below 15%. The evaluation across synthetic, PARSEC, and STREAM workloads demonstrates that speedups depend on workload characteristics such as data sharing, while cache-miss-rate errors remain low (under ~2.5%), indicating practical fidelity gains. The results highlight parti-gem5's potential for scalable microarchitectural exploration and point to future work on formal verification and broader architectural coverage.

Abstract

Detailed timing models are indispensable tools for the design space exploration of Multiprocessor Systems on Chip (MPSoCs). As core counts continue to increase, the complexity in memory hierarchies and interconnect topologies is also growing, making accurate predictions of design decisions more challenging than ever. In this context, the open-source Full System Simulator (FSS) gem5 is a popular choice for MPSoC design space exploration, thanks to its flexibility and robust set of detailed timing models. However, its single-threaded simulation kernel severely hampers its throughput. To address this challenge, we introduce parti-gem5, an extension of gem5 that enables parallel timing simulations on modern multi-core simulation hosts. Unlike previous works, parti-gem5 supports gem5's timing mode, the O3CPU, and Ruby's custom cache and interconnect models. Compared to reference single-thread simulations, we achieved speedups of up to 42.7x when simulating a 120-core ARM MPSoC on a 64-core x86-64 host system. While our method introduces timing deviations, the error in total simulated time is below 15% in most cases.

parti-gem5: gem5's Timing Mode Parallelised

TL;DR

This paper introduces parti-gem5, a parallel timing-mode extension to gem5 that enables timing-mode simulations on multi-core hosts and extends prior work by supporting detailed timing models (MinorCPU, O3CPU) and the Ruby cache/interconnect. It presents thread-safe strategies for Ruby messaging and non-coherent traffic across time-domain partitions, achieving up to 42.7x speedups on a 120-core ARM MPSoC with total-time errors typically below 15%. The evaluation across synthetic, PARSEC, and STREAM workloads demonstrates that speedups depend on workload characteristics such as data sharing, while cache-miss-rate errors remain low (under ~2.5%), indicating practical fidelity gains. The results highlight parti-gem5's potential for scalable microarchitectural exploration and point to future work on formal verification and broader architectural coverage.

Abstract

Detailed timing models are indispensable tools for the design space exploration of Multiprocessor Systems on Chip (MPSoCs). As core counts continue to increase, the complexity in memory hierarchies and interconnect topologies is also growing, making accurate predictions of design decisions more challenging than ever. In this context, the open-source Full System Simulator (FSS) gem5 is a popular choice for MPSoC design space exploration, thanks to its flexibility and robust set of detailed timing models. However, its single-threaded simulation kernel severely hampers its throughput. To address this challenge, we introduce parti-gem5, an extension of gem5 that enables parallel timing simulations on modern multi-core simulation hosts. Unlike previous works, parti-gem5 supports gem5's timing mode, the O3CPU, and Ruby's custom cache and interconnect models. Compared to reference single-thread simulations, we achieved speedups of up to 42.7x when simulating a 120-core ARM MPSoC on a 64-core x86-64 host system. While our method introduces timing deviations, the error in total simulated time is below 15% in most cases.
Paper Structure (17 sections, 9 figures, 3 tables)

This paper contains 17 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Flow diagram of gem5's DES and PDES. Adapted from pargem5.
  • Figure 2: Communication between a requester and a responder using the Atomic Mode (a) and the Timing Mode (b).
  • Figure 3: Ruby Message Passing
  • Figure 4: Topology and event queue assignment of an exemplary Ruby system. For the sake of simplicity, some components, like message buffers or TLBs are not depicted.
  • Figure 5: Ruby parallelisation challenges and solutions. (a) Multiple senders S0 and S1 communicate with a single consumer C0. (b,c) Bi-directional message passing between two routers R0 and R1. The circular wait in (b) is eliminated in (c) by introducing the Throttle objects T0 and T1.
  • ...and 4 more figures