Enabling full-speed random access to the entire memory on the A100 GPU
Alden Walker
TL;DR
The paper addresses the challenge of performing truly random access to large portions of A100 GPU memory despite limited TLB reach. It reverses engineering the hardware layout by TLB probing to identify SM groups that share memory resources. The authors demonstrate that constraining each SM group to a 64GB window yields full-speed random access across the entire memory. This approach has practical implications for HPC workloads on A100 GPUs and highlights the importance of TLB topology in memory performance tuning.
Abstract
We describe some features of the A100 memory architecture. In particular, we give a technique to reverse-engineer some hardware layout information. Using this information, we show how to avoid TLB issues to obtain full-speed random HBM access to the entire memory, as long as we constrain any particular thread to a reduced access window of less than 64GB.
