Table of Contents
Fetching ...

GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency

Soham Chakraborty, S. Krishna, Andreas Pavlogiannis, Omkar Tuppe

TL;DR

The paper tackles correctness verification for GPU programs under weak memory concurrency by introducing GPUMC, a stateless model checker tailored to the scoped-RC11 ($SRC11$) model. It extends dynamic partial order reduction (DPOR) to GPUs, providing exploration, barrier-aware scheduling, and reads-from management while guaranteeing soundness, completeness, and optimality with polynomial space. A key feature is automatic repair of certain heterogeneous races, enabling transformation of pathological programs into race-free versions. Empirical evaluation shows GPUMC scales to larger GPU programs, detects races that other tools miss, and outperforms bounded-model-checking approaches like Dartagnan in both time and memory, offering a practical pathway to reliable GPU software engineering under weak memory models.

Abstract

GPU computing is embracing weak memory concurrency for performance improvement. However, compared to CPUs, modern GPUs provide more fine-grained concurrency features such as scopes, have additional properties like divergence, and thereby follow different weak memory consistency models. These features and properties make concurrent programming on GPUs more complex and error-prone. To this end, we present GPUMC, a stateless model checker to check the correctness of GPU shared-memory concurrent programs under scoped-RC11 weak memory concurrency model. GPUMC explores all possible executions in GPU programs to reveal various errors - races, barrier divergence, and assertion violations. In addition, GPUMC also automatically repairs these errors in the appropriate cases. We evaluate GPUMC with benchmarks and real-life GPU programs. GPUMC is efficient both in time and memory in verifying large GPU programs where state-of-the-art tools are timed out. In addition, GPUMC identifies all known errors in these benchmarks compared to the state-of-the-art tools.

GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency

TL;DR

The paper tackles correctness verification for GPU programs under weak memory concurrency by introducing GPUMC, a stateless model checker tailored to the scoped-RC11 () model. It extends dynamic partial order reduction (DPOR) to GPUs, providing exploration, barrier-aware scheduling, and reads-from management while guaranteeing soundness, completeness, and optimality with polynomial space. A key feature is automatic repair of certain heterogeneous races, enabling transformation of pathological programs into race-free versions. Empirical evaluation shows GPUMC scales to larger GPU programs, detects races that other tools miss, and outperforms bounded-model-checking approaches like Dartagnan in both time and memory, offering a practical pathway to reliable GPU software engineering under weak memory models.

Abstract

GPU computing is embracing weak memory concurrency for performance improvement. However, compared to CPUs, modern GPUs provide more fine-grained concurrency features such as scopes, have additional properties like divergence, and thereby follow different weak memory consistency models. These features and properties make concurrent programming on GPUs more complex and error-prone. To this end, we present GPUMC, a stateless model checker to check the correctness of GPU shared-memory concurrent programs under scoped-RC11 weak memory concurrency model. GPUMC explores all possible executions in GPU programs to reveal various errors - races, barrier divergence, and assertion violations. In addition, GPUMC also automatically repairs these errors in the appropriate cases. We evaluate GPUMC with benchmarks and real-life GPU programs. GPUMC is efficient both in time and memory in verifying large GPU programs where state-of-the-art tools are timed out. In addition, GPUMC identifies all known errors in these benchmarks compared to the state-of-the-art tools.

Paper Structure

This paper contains 25 sections, 9 theorems, 8 equations, 11 figures, 9 tables, 5 algorithms.

Key Result

theorem 1

The DPOR algorithm for $\mathsf{SRC11}$ is sound, complete and optimal.

Figures (11)

  • Figure 1: Example of GPU concurrency errors. In (a), we have two threads $T_1, T_2$ from the CTAs $\mathsf{cta}_1, \mathsf{cta}_2$. In (b) all threads are in the same CTA.
  • Figure 2: Executions shown in (b) and (c) are independent of whether $i=j$ or not. (b) shows an execution where $Y$ reads 0 from the initial location. (c) shows an execution where $Y$ and $X$ read 1 in $\mathsf{T}_2$. (d) shows an execution where $Y$ reads 1 from $\mathsf{T}_1$ but cannot synchronize, as $\mathsf{T}_1$ and $\mathsf{T}_2$ are in different CTAs ($i\neq j$). If $i\neq j$, $X$ may read 0 from initialization. (e) is a special case of execution shown in (c) where $i = j$. If $i==j$, then read and write on $Y$ are in synchronization relation because these accesses on $Y$ are scope-inclusive. (f) shows an execution where there is a synchronization on $Y$ with an inclusion relation (so again $i=j$). Hence, $X$ in $\mathsf{T}_2$ cannot read value 0 from initialization, as it violates the coherence axiom; consequently, the execution is forbidden.
  • Figure 3: $\mathsf{SRC11}$ relations and axioms with some violation patterns.
  • Figure 4: \ref{['namegmem']} Scalability: $\mathsf{GPUMC}$ execution time and the memory consumed to detect the heterogeneous race for 1dconv and GCON and the assertion violation in SB. The x-axis shows the total number of threads for GCON and SB, and CTAs for 1dconv. The y-axis measures the memory in megabytes (MB) and the time in seconds.
  • Figure 5: Checks of $\textsc{CheckOptimal}$ : \ref{['eisread']}. In the execution, we use the simpler notations $rd(x,v), wt(x,v)$ since the scope and access is clear from the program.
  • ...and 6 more figures

Theorems & Definitions (17)

  • theorem 1
  • lemma 1
  • proof
  • lemma 2
  • proof
  • lemma 3
  • proof
  • lemma 4
  • proof
  • lemma 5
  • ...and 7 more