Table of Contents
Fetching ...

Multiport Support for Vortex OpenGPU Memory Hierarchy

Injae Shin, Blaise Tine

TL;DR

The paper addresses memory bandwidth bottlenecks in GPGPUs by extending the Vortex OpenGPU with a multiport memory hierarchy that scales across the L1–LLC caches to better exploit High-Bandwidth Memory channels. It presents three arbitration strategies and a detailed implementation of multiport ports, evaluating performance via cycle-level and RTL simulations and hardware synthesis. Results show significant IPC improvements—up to about 2.34x with 8 memory ports—especially for memory-bound workloads, with a manageable area cost (notably LUT and register increases) and no definitive winner among arbitration schemes. The work demonstrates a practical approach to leveraging HBM parallelism to boost GPU memory bandwidth, informing future GPGPU designs and compiler/kernel optimization directions.

Abstract

Modern day applications have grown in size and require more computational power. The rise of machine learning and AI increased the need for parallel computation, which has increased the need for GPGPUs. With the increasing demand for computational power, GPGPUs' SIMT architecture has solved this with an increase in the number of threads and the number of cores in a GPU, increasing the throughput of these processors to match the demand of the applications. However, this created a larger demand for the memory, making the memory bandwidth a bottleneck. The introduction of High-Bandwidth Memory (HBM) with its increased number of memory ports offers a potential solution for the GPU to exploit its memory parallelism to increase the memory bandwidth. However, effectively leveraging HBM's memory parallelism to maximize bandwidth presents a unique and complex challenge for GPU architectures on how to distribute those ports among the streaming multiprocessors in the GPGPU. In this work, we extend the Vortex OpenGPU microarchitecture to incorporate a multiport memory hierarchy, spanning from the L1 cache to the last-level cache (LLC). In addition, we propose various arbitration strategies to optimize memory transfers across the cache hierarchy. The results have shown that an increase in memory ports increases IPC, achieving an average speedup of 2.34x with 8 memory ports in the tested configuration while showing relatively small area overhead.

Multiport Support for Vortex OpenGPU Memory Hierarchy

TL;DR

The paper addresses memory bandwidth bottlenecks in GPGPUs by extending the Vortex OpenGPU with a multiport memory hierarchy that scales across the L1–LLC caches to better exploit High-Bandwidth Memory channels. It presents three arbitration strategies and a detailed implementation of multiport ports, evaluating performance via cycle-level and RTL simulations and hardware synthesis. Results show significant IPC improvements—up to about 2.34x with 8 memory ports—especially for memory-bound workloads, with a manageable area cost (notably LUT and register increases) and no definitive winner among arbitration schemes. The work demonstrates a practical approach to leveraging HBM parallelism to boost GPU memory bandwidth, informing future GPGPU designs and compiler/kernel optimization directions.

Abstract

Modern day applications have grown in size and require more computational power. The rise of machine learning and AI increased the need for parallel computation, which has increased the need for GPGPUs. With the increasing demand for computational power, GPGPUs' SIMT architecture has solved this with an increase in the number of threads and the number of cores in a GPU, increasing the throughput of these processors to match the demand of the applications. However, this created a larger demand for the memory, making the memory bandwidth a bottleneck. The introduction of High-Bandwidth Memory (HBM) with its increased number of memory ports offers a potential solution for the GPU to exploit its memory parallelism to increase the memory bandwidth. However, effectively leveraging HBM's memory parallelism to maximize bandwidth presents a unique and complex challenge for GPU architectures on how to distribute those ports among the streaming multiprocessors in the GPGPU. In this work, we extend the Vortex OpenGPU microarchitecture to incorporate a multiport memory hierarchy, spanning from the L1 cache to the last-level cache (LLC). In addition, we propose various arbitration strategies to optimize memory transfers across the cache hierarchy. The results have shown that an increase in memory ports increases IPC, achieving an average speedup of 2.34x with 8 memory ports in the tested configuration while showing relatively small area overhead.
Paper Structure (15 sections, 5 figures, 3 tables)

This paper contains 15 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Vortex Multiport Microarchitecture
  • Figure 2: Arbitration Strategies
  • Figure 3: L1 Port Sharing among ICache and DCache
  • Figure 4: Raw IPC performance for SimX and RTLSim
  • Figure 5: Relative IPC performance for SimX and RTLSim relative to Mem_Ports=1