Table of Contents
Fetching ...

Analyzing and Improving Hardware Modeling of Accel-Sim

Rodrigo Huerta, Mojtaba Abaie Shoushtary, Antonio González

TL;DR

The paper addresses realism gaps in Accel-Sim's SM modeling by implementing a per-sub-core front-end with private L0 caches, a conflict-aware result-bus that respects register-bank constraints, and a distributed memory pipeline with separated address-coalescing per sub-core. These changes are evaluated against a 42-benchmark suite using trace-driven RTX 2070 Super configurations, revealing small average speed-ups but notable per-benchmark AVC improvements, especially from memory-pipeline enhancements. The work provides a more faithful platform for exploring GPU microarchitectural ideas and co-designs, while outlining extensive future work inside and outside the SM to further enhance modeling fidelity. Overall, the contributions advance performance-accurate GPU simulation and enable deeper investigation of hardware/software optimization opportunities in modern GPUs.

Abstract

GPU architectures have become popular for executing general-purpose programs. Their many-core architecture supports a large number of threads that run concurrently to hide the latency among dependent instructions. In modern GPU architectures, each SM/core is typically composed of several sub-cores, where each sub-core has its own independent pipeline. Simulators are a key tool for investigating novel concepts in computer architecture. They must be performance-accurate and have a proper model related to the target hardware to explore the different bottlenecks properly. This paper presents a wide analysis of different parts of Accel-sim, a popular GPGPU simulator, and some improvements of its model. First, we focus on the front-end and developed a more realistic model. Then, we analyze the way the result bus works and develop a more realistic one. Next, we describe the current memory pipeline model and propose a model for a more cost-effective design. Finally, we discuss other areas of improvement of the simulator.

Analyzing and Improving Hardware Modeling of Accel-Sim

TL;DR

The paper addresses realism gaps in Accel-Sim's SM modeling by implementing a per-sub-core front-end with private L0 caches, a conflict-aware result-bus that respects register-bank constraints, and a distributed memory pipeline with separated address-coalescing per sub-core. These changes are evaluated against a 42-benchmark suite using trace-driven RTX 2070 Super configurations, revealing small average speed-ups but notable per-benchmark AVC improvements, especially from memory-pipeline enhancements. The work provides a more faithful platform for exploring GPU microarchitectural ideas and co-designs, while outlining extensive future work inside and outside the SM to further enhance modeling fidelity. Overall, the contributions advance performance-accurate GPU simulation and enable deeper investigation of hardware/software optimization opportunities in modern GPUs.

Abstract

GPU architectures have become popular for executing general-purpose programs. Their many-core architecture supports a large number of threads that run concurrently to hide the latency among dependent instructions. In modern GPU architectures, each SM/core is typically composed of several sub-cores, where each sub-core has its own independent pipeline. Simulators are a key tool for investigating novel concepts in computer architecture. They must be performance-accurate and have a proper model related to the target hardware to explore the different bottlenecks properly. This paper presents a wide analysis of different parts of Accel-sim, a popular GPGPU simulator, and some improvements of its model. First, we focus on the front-end and developed a more realistic model. Then, we analyze the way the result bus works and develop a more realistic one. Next, we describe the current memory pipeline model and propose a model for a more cost-effective design. Finally, we discuss other areas of improvement of the simulator.
Paper Structure (11 sections, 7 figures, 1 table)

This paper contains 11 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: SM architecture.
  • Figure 2: Current Accel-Sim front-end.
  • Figure 3: Proposed front-end.
  • Figure 4: Memory execution pipeline of Accel-Sim.
  • Figure 5: Proposed memory execution pipeline.
  • ...and 2 more figures