RTop-K: Ultra-Fast Row-Wise Top-K Selection for Neural Network Acceleration on GPUs

Xi Xie; Yuebo Luo; Hongwu Peng; Caiwen Ding

RTop-K: Ultra-Fast Row-Wise Top-K Selection for Neural Network Acceleration on GPUs

Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding

TL;DR

This work targets the challenge of efficient row-wise top-$k$ selection on GPUs for neural network workloads, especially MaxK-GNNs. It introduces RTop-K, a binary search-based top-$k$ algorithm that operates per row with optional early stopping to balance speed and accuracy, and it provides theoretical and empirical analysis of the iteration behavior. The GPU kernel design features a three-stage pipeline (load, search, select) and leverages warp-level primitives to minimize memory traffic while producing exactly (or approximately) the top-$k$ elements per row. Empirical results show substantial end-to-end gains, with average kernel speedups up to $11.49\times$ (and $7.29\times$ without early stopping) against PyTorch, and overall MaxK-GNN training speedups from roughly $12\%$ to $33\%$ across multiple models and datasets, while maintaining robust accuracy under early stopping.

Abstract

Top-k selection algorithms are fundamental in a wide range of applications, including high-performance computing, information retrieval, big data processing, and neural network model training. In this paper, we present RTop-K, a highly efficient parallel row-wise top-k selection algorithm specifically designed for GPUs. RTop-K leverages a binary search-based approach to optimize row-wise top-k selection, providing a scalable and accelerated solution. We conduct a detailed analysis of early stopping in our algorithm, showing that it effectively maintains the testing accuracy of neural network models while substantially improving performance. Our GPU implementation of RTop-K demonstrates superior performance over state-of-the-art row-wise top-k GPU implementations, achieving an average speed-up of up to 11.49$\times$ with early stopping and 7.29$\times$ without early stopping. Moreover, RTop-K accelerates the overall training workflow of MaxK-GNNs, delivering speed-ups ranging from 11.97% to 33.29% across different models and datasets.

RTop-K: Ultra-Fast Row-Wise Top-K Selection for Neural Network Acceleration on GPUs

TL;DR

This work targets the challenge of efficient row-wise top-

selection on GPUs for neural network workloads, especially MaxK-GNNs. It introduces RTop-K, a binary search-based top-

algorithm that operates per row with optional early stopping to balance speed and accuracy, and it provides theoretical and empirical analysis of the iteration behavior. The GPU kernel design features a three-stage pipeline (load, search, select) and leverages warp-level primitives to minimize memory traffic while producing exactly (or approximately) the top-

elements per row. Empirical results show substantial end-to-end gains, with average kernel speedups up to

(and

without early stopping) against PyTorch, and overall MaxK-GNN training speedups from roughly

across multiple models and datasets, while maintaining robust accuracy under early stopping.

Abstract

with early stopping and 7.29

without early stopping. Moreover, RTop-K accelerates the overall training workflow of MaxK-GNNs, delivering speed-ups ranging from 11.97% to 33.29% across different models and datasets.

Paper Structure (16 sections, 6 equations, 9 figures, 5 tables, 2 algorithms)

This paper contains 16 sections, 6 equations, 9 figures, 5 tables, 2 algorithms.

Introduction
Preliminary and Related Works
Top-k Algorithms
GPU Architecture
GPU Top-k Implementations
RTop-K Framework
Binary Search-based Top-k Selection Algorithm
GPU Implementation Design
Experiments
Setup and Configuration
RTop-K Kernel Evaluation
Model Training and Testing Performance Evaluation
Conclusion
Acknowledgments
The expectation of the iteration counts of Algorithm \ref{['alg:binary_search_topk']}
...and 1 more sections

Figures (9)

Figure 1: The core operation of MaxK-GNN, which introduces row-wise top-k selection into the GNN workflow to provide non-linearity and acceleration.
Figure 2: Illustration of the binary search-based top-k selection algorithm.
Figure 3: GPU implementation of the binary search-based top-k selection algorithm.
Figure 4: Comparison of kernel execution time (ms) between RTop-K with different early stopping $max\_iter$ values and without early stopping ($\epsilon = 10^{-16}$), against PyTorch for various configurations of ($N$, $M$, $k$), where $N = 2^{14}, 2^{16}, 2^{18}, 2^{20}$, $M = 256, 512, 768$, and $k = 16, 32, 64, 96, 128$. The average speedup of the no early stopping version for each $(N, M)$ setting is indicated in the title of each subplot.
Figure 5: Overall training speed-up ratio and testing accuracy of applying RTop-K to various MaxK-GNN model training processes on different graphs. Setting: $N=\text{\#Nodes}$, $M=256$, $k=32$.
...and 4 more figures

RTop-K: Ultra-Fast Row-Wise Top-K Selection for Neural Network Acceleration on GPUs

TL;DR

Abstract

RTop-K: Ultra-Fast Row-Wise Top-K Selection for Neural Network Acceleration on GPUs

Authors

TL;DR

Abstract

Table of Contents

Figures (9)