Control Flow Management in Modern GPUs

Mojtaba Abaie Shoushtary; Jordi Tubella Murgadas; Antonio Gonzalez

Control Flow Management in Modern GPUs

Mojtaba Abaie Shoushtary, Jordi Tubella Murgadas, Antonio Gonzalez

TL;DR

This work tackles the opacity of control-flow management in NVIDIA GPUs by deriving plausible semantics for the Turing native ISA and introducing Hanoi, a lightweight control-flow mechanism. Hanoi uses a dual-stack microarchitecture with WS and REC stacks, plus Bx/Rx registers and simple predicates, to reproduce realistic reconvergence and interleaving behavior while remaining hardware-efficient. Through extensive binary/trace analysis and a validation checker, the authors show Hanoi closely matches real hardware traces and incurs negligible IPC difference on most benchmarks, with small hardware overhead. The study enables accurate performance modeling and research beyond PTX-based abstractions, offering practical guidance for reconvergence strategies and deadlock avoidance in modern GPUs.

Abstract

In GPUs, the control flow management mechanism determines which threads in a warp are active at any point in time. This mechanism monitors the control flow of scalar threads within a warp to optimize thread scheduling and plays a critical role in the utilization of execution resources. The control flow management mechanism can be controlled or assisted by software through instructions. However, GPU vendors do not disclose details about their compiler, ISA, or hardware implementations. This lack of transparency makes it challenging for researchers to understand how the control flow management mechanism functions, is implemented, or is assisted by software, which is crucial when it significantly affects their research. It is also problematic for performance modeling of GPUs, as one can only rely on traces from real hardware for control flow and cannot model or modify the functionality of the mechanism altering it. This paper addresses this issue by defining a plausible semantic for control flow instructions in the Turing native ISA based on insights gleaned from experimental data using various benchmarks. Based on these definitions, we propose a low-cost mechanism for efficient control flow management named Hanoi. Hanoi ensures correctness and generates a control flow that is very close to real hardware. Our evaluation shows that the discrepancy between the control flow trace of real hardware and our mechanism is only 1.03% on average. Furthermore, when comparing the Instructions Per Cycle (IPC) of GPUs employing Hanoi with the native control flow management of actual hardware, the average difference is just 0.19%.

Control Flow Management in Modern GPUs

TL;DR

Abstract

Paper Structure (25 sections, 10 figures, 3 tables)

This paper contains 25 sections, 10 figures, 3 tables.

Introduction
Pre-Volta Control Flow Management
SIMT-Induced Deadlocks in Pre-Volta
Post-Volta Control Flow Management
Turing Control Flow Instructions
Predicated Control Flow Instructions
EXIT Instruction
BRA Instruction
CALL and RET Instructions
BMOV, BSSY, BSYNC, and BREAK Instructions
WARPSYNC Instruction
YILED Instruction
Practical Applications of Turing Control Flow Instructions
Reconvergence after Nested Branches
Reconvergence Earlier than IPDom
...and 10 more sections

Figures (10)

Figure 1: Control flow management for a pre-Volta GPU model with 4 threads in a warp: (a) Code sample with branch divergence independent_tsch, (b) Divergent threads execution model CudaGuideindependent_tsch, and (c) Plausible control flow management implementation RPUDWFGPGPU_SIM_Manual
Figure 2: GPU's SIMT core microarchitecture GPGPU_SIM_Manual
Figure 3: Spinlock Cuda Implemetation
Figure 4: Control flow management for a post-Volta GPU model with 4 threads in a warp: (a) Code sample with branch divergence independent_tsch, (b) Plausible execution on post-Volta GPUs independent_tsch
Figure 5: Sample of reconvergence after nested branches
...and 5 more figures

Control Flow Management in Modern GPUs

TL;DR

Abstract

Control Flow Management in Modern GPUs

Authors

TL;DR

Abstract

Table of Contents

Figures (10)