Table of Contents
Fetching ...

Enabling full-speed random access to the entire memory on the A100 GPU

Alden Walker

TL;DR

The paper addresses the challenge of performing truly random access to large portions of A100 GPU memory despite limited TLB reach. It reverses engineering the hardware layout by TLB probing to identify SM groups that share memory resources. The authors demonstrate that constraining each SM group to a 64GB window yields full-speed random access across the entire memory. This approach has practical implications for HPC workloads on A100 GPUs and highlights the importance of TLB topology in memory performance tuning.

Abstract

We describe some features of the A100 memory architecture. In particular, we give a technique to reverse-engineer some hardware layout information. Using this information, we show how to avoid TLB issues to obtain full-speed random HBM access to the entire memory, as long as we constrain any particular thread to a reduced access window of less than 64GB.

Enabling full-speed random access to the entire memory on the A100 GPU

TL;DR

The paper addresses the challenge of performing truly random access to large portions of A100 GPU memory despite limited TLB reach. It reverses engineering the hardware layout by TLB probing to identify SM groups that share memory resources. The authors demonstrate that constraining each SM group to a 64GB window yields full-speed random access across the entire memory. This approach has practical implications for HPC workloads on A100 GPUs and highlights the importance of TLB topology in memory performance tuning.

Abstract

We describe some features of the A100 memory architecture. In particular, we give a technique to reverse-engineer some hardware layout information. Using this information, we show how to avoid TLB issues to obtain full-speed random HBM access to the entire memory, as long as we constrain any particular thread to a reduced access window of less than 64GB.
Paper Structure (10 sections, 6 figures)

This paper contains 10 sections, 6 figures.

Figures (6)

  • Figure 1: Memory throughput for random access.
  • Figure 2: Probing SM pairs.
  • Figure 3: Rearranging SM indices to clarify SM memory resource groupings.
  • Figure 4: Running each resource group individually.
  • Figure 5: Running pairs of resource groups.
  • ...and 1 more figures