Table of Contents
Fetching ...

HiRace: Accurate and Fast Source-Level Race Checking of GPU Programs

John Jacobson, Martin Burtscher, Ganesh Gopalakrishnan

TL;DR

HiRace addresses GPU data races by instrumenting CUDA source code and performing dynamic analysis. It uses a fixed-size per-address shadow state encoded by a finite-state machine with $25$ states and $1200$ transitions, requiring only $5$ bits per address and about $8$ bytes total per shadow entry. The approach is validated against the Indigo benchmark suite and verified with the Murphi model checker to ensure correctness. Empirical results show HiRace detects more races than prior tools, with up to $30$-$50$x speedups and roughly half the memory overhead. This combination yields a practical, scalable GPU race detector that runs at source level and does not depend on compiler/hardware specifics, and it will be open-sourced.

Abstract

Data races are egregious parallel programming bugs on CPUs. They are even worse on GPUs due to the hierarchical thread and memory structure, which makes it possible to write code that is correctly synchronized within a thread group while not being correct across groups. Thus far, all major data-race checkers for GPUs suffer from at least one of the following problems: they do not check races in global memory, do not work on recent GPUs, scale poorly, have not been extensively tested, miss simple data races, or are not dependable without detailed knowledge of the compiler. Our new data-race detection tool, HiRace, overcomes these limitations. Its key novelty is an innovative parallel finite-state machine that condenses an arbitrarily long access history into a constant-length state, thus allowing it to handle large and long-running programs. HiRace is a dynamic tool that checks for thread-group shared memory and global device memory races. It utilizes source-code instrumentation, thus avoiding driver, compiler, and hardware dependencies. We evaluate it on a modern calibrated data-race benchmark suite. On the 580 tested CUDA kernels, 346 of which contain data races, HiRace finds races missed by other tools without false alarms and is more than 10 times faster on average than the current state of the art, while incurring only half the memory overhead.

HiRace: Accurate and Fast Source-Level Race Checking of GPU Programs

TL;DR

HiRace addresses GPU data races by instrumenting CUDA source code and performing dynamic analysis. It uses a fixed-size per-address shadow state encoded by a finite-state machine with states and transitions, requiring only bits per address and about bytes total per shadow entry. The approach is validated against the Indigo benchmark suite and verified with the Murphi model checker to ensure correctness. Empirical results show HiRace detects more races than prior tools, with up to -x speedups and roughly half the memory overhead. This combination yields a practical, scalable GPU race detector that runs at source level and does not depend on compiler/hardware specifics, and it will be open-sourced.

Abstract

Data races are egregious parallel programming bugs on CPUs. They are even worse on GPUs due to the hierarchical thread and memory structure, which makes it possible to write code that is correctly synchronized within a thread group while not being correct across groups. Thus far, all major data-race checkers for GPUs suffer from at least one of the following problems: they do not check races in global memory, do not work on recent GPUs, scale poorly, have not been extensively tested, miss simple data races, or are not dependable without detailed knowledge of the compiler. Our new data-race detection tool, HiRace, overcomes these limitations. Its key novelty is an innovative parallel finite-state machine that condenses an arbitrarily long access history into a constant-length state, thus allowing it to handle large and long-running programs. HiRace is a dynamic tool that checks for thread-group shared memory and global device memory races. It utilizes source-code instrumentation, thus avoiding driver, compiler, and hardware dependencies. We evaluate it on a modern calibrated data-race benchmark suite. On the 580 tested CUDA kernels, 346 of which contain data races, HiRace finds races missed by other tools without false alarms and is more than 10 times faster on average than the current state of the art, while incurring only half the memory overhead.
Paper Structure (18 sections, 6 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: State machine for code without barriers; any unmentioned outgoing transition from a state means we stay in the same state---a "self loop"
  • Figure 2: Shadow information used by HiRace
  • Figure 3: Grid of CUDA threads assumed in Listing 1 (for which we ignore B0 and B1) and Listing 2 (for which we consider B0 and B1)
  • Figure 4: State-machine for code with block synchronization; any unmentioned outgoing transition from a state means we stay in the same state
  • Figure 5: HiRace speedup vs. iGuard. Each point represents execution of one Indigo benchmark kernel on the associated input graph.
  • ...and 1 more figures