Enabling full-speed random access to the entire memory on the A100 GPU

Alden Walker

Enabling full-speed random access to the entire memory on the A100 GPU

Alden Walker

TL;DR

The paper addresses the challenge of performing truly random access to large portions of A100 GPU memory despite limited TLB reach. It reverses engineering the hardware layout by TLB probing to identify SM groups that share memory resources. The authors demonstrate that constraining each SM group to a 64GB window yields full-speed random access across the entire memory. This approach has practical implications for HPC workloads on A100 GPUs and highlights the importance of TLB topology in memory performance tuning.

Abstract

We describe some features of the A100 memory architecture. In particular, we give a technique to reverse-engineer some hardware layout information. Using this information, we show how to avoid TLB issues to obtain full-speed random HBM access to the entire memory, as long as we constrain any particular thread to a reduced access window of less than 64GB.

Enabling full-speed random access to the entire memory on the A100 GPU

TL;DR

Abstract

Paper Structure (10 sections, 6 figures)

This paper contains 10 sections, 6 figures.

Introduction
A100
TLBs
Use case and summary of results
Experiments
Basic benchmark
Determining SM resource groups
Checking for resource group independence
Getting full throughput
Conclusion

Figures (6)

Figure 1: Memory throughput for random access.
Figure 2: Probing SM pairs.
Figure 3: Rearranging SM indices to clarify SM memory resource groupings.
Figure 4: Running each resource group individually.
Figure 5: Running pairs of resource groups.
...and 1 more figures

Enabling full-speed random access to the entire memory on the A100 GPU

TL;DR

Abstract

Enabling full-speed random access to the entire memory on the A100 GPU

Authors

TL;DR

Abstract

Table of Contents

Figures (6)