Modeling Utilization to Identify Shared-Memory Atomic Bottlenecks
Rongcui Dong, Sreepathi Pai
TL;DR
This work addresses the challenge of identifying bottlenecks in shared-memory atomics for data-dependent GPU workloads, where traditional models like Roofline are insufficient. It introduces a practical, counter-driven single-server queuing model that treats the shared-memory atomic unit as a load-dependent server, with parameters derived from microbenchmarks and hardware counters. The authors implement two tools to (i) tabulate $S(n,e,c)$ for a given GPU family and (ii) profile real programs to compute per-SM utilization $U^{(i)}$, revealing bottlenecks and shifts between atomics and global memory. A case study on image histogram kernels demonstrates up to 30% performance differences due to access patterns and architectural features like ATOMS.POPC.INC, validating the model’s ability to guide optimizations. The approach offers a practical pathway to quantify and reason about atomic contention in data-dependent workloads, with potential applicability to other GPU units given suitable counters.
Abstract
Performance analysis is critical for GPU programs with data-dependent behavior, but models like Roofline are not very useful for them and interpreting raw performance counters is tedious. In this work, we present an analytical model for shared memory atomics (\emph{fetch-and-op} and \emph{compare-and-swap} instructions on NVIDIA Volta and Ampere GPU) that allows users to immediately determine if shared memory atomic operations are a bottleneck for a program's execution. Our model is based on modeling the architecture as a single-server queuing model whose inputs are performance counters. It captures load-dependent behavior such as pipelining, parallelism, and different access patterns. We embody this model in a tool that uses CUDA hardware counters as parameters to predict the utilization of the shared-memory atomic unit. To the best of our knowledge, no existing profiling tool or model provides this capability for shared-memory atomic operations. We used the model to compare two histogram kernels that use shared-memory atomics. Although nearly identical, their performance can be different by up to 30\%. Our tool correctly identifies a bottleneck shift from shared-memory atomic unit as the cause of this discrepancy.
