Table of Contents
Fetching ...

Parallelizing a modern GPU simulator

Rodrigo Huerta, Antonio González

TL;DR

This work tackles the heavy cost of simulating modern GPU architectures by introducing a minimal OpenMP-based parallelization of the Accel-sim framework, achieving deterministic results that match the single-threaded version. Across workloads, it reports an average speed-up of about 5.8x with 16 threads (up to 14x on some cases), driven by parallelizing SM-level execution and per-SM statistics to avoid data races. The study also analyzes the impact of OpenMP scheduling, showing workload-dependent benefits for static versus dynamic scheduling. The approach enables researchers to simulate larger GPUs and workloads with results obtained faster, improving productivity and modeling fidelity without sacrificing accuracy.

Abstract

Simulators are a primary tool in computer architecture research but are extremely computationally intensive. Simulating modern architectures with increased core counts and recent workloads can be challenging, even on modern hardware. This paper demonstrates that simulating some GPGPU workloads in a single-threaded state-of-the-art simulator such as Accel-sim can take more than five days. In this paper we present a simple approach to parallelize this simulator with minimal code changes by using OpenMP. Moreover, our parallelization technique is deterministic, so the simulator provides the same results for single-threaded and multi-threaded simulations. Compared to previous works, we achieve a higher speed-up, and, more importantly, the parallel simulation does not incur any inaccuracies. When we run the simulator with 16 threads, we achieve an average speed-up of 5.8x and reach 14x in some workloads. This allows researchers to simulate applications that take five days in less than 12 hours. By speeding up simulations, researchers can model larger systems, simulate bigger workloads, add more detail to the model, increase the efficiency of the hardware platform where the simulator is run, and obtain results sooner.

Parallelizing a modern GPU simulator

TL;DR

This work tackles the heavy cost of simulating modern GPU architectures by introducing a minimal OpenMP-based parallelization of the Accel-sim framework, achieving deterministic results that match the single-threaded version. Across workloads, it reports an average speed-up of about 5.8x with 16 threads (up to 14x on some cases), driven by parallelizing SM-level execution and per-SM statistics to avoid data races. The study also analyzes the impact of OpenMP scheduling, showing workload-dependent benefits for static versus dynamic scheduling. The approach enables researchers to simulate larger GPUs and workloads with results obtained faster, improving productivity and modeling fidelity without sacrificing accuracy.

Abstract

Simulators are a primary tool in computer architecture research but are extremely computationally intensive. Simulating modern architectures with increased core counts and recent workloads can be challenging, even on modern hardware. This paper demonstrates that simulating some GPGPU workloads in a single-threaded state-of-the-art simulator such as Accel-sim can take more than five days. In this paper we present a simple approach to parallelize this simulator with minimal code changes by using OpenMP. Moreover, our parallelization technique is deterministic, so the simulator provides the same results for single-threaded and multi-threaded simulations. Compared to previous works, we achieve a higher speed-up, and, more importantly, the parallel simulation does not incur any inaccuracies. When we run the simulator with 16 threads, we achieve an average speed-up of 5.8x and reach 14x in some workloads. This allows researchers to simulate applications that take five days in less than 12 hours. By speeding up simulations, researchers can model larger systems, simulate bigger workloads, add more detail to the model, increase the efficiency of the hardware platform where the simulator is run, and obtain results sooner.

Paper Structure

This paper contains 8 sections, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Time in seconds required to execute each workload with a single thread.
  • Figure 2: GPU design.
  • Figure 3: SM design.
  • Figure 4: Profiler output.
  • Figure 5: Speed-up with a different number of threads against the single-threaded version.
  • ...and 2 more figures