Table of Contents
Fetching ...

AutoSAGE: Input-Aware CUDA Scheduling for Sparse GNN Aggregation (SpMM/SDDMM) and CSR Attention

Aleksandar Stankovic

TL;DR

AutoSAGE tackles throughput variability in CSR SpMM/SDDMM for GNNs on GPUs by selecting input-specific tiling and mapping via a lightweight estimate refined by on-device micro-probes, guarded by a regression-safe guardrail and a replayable cache. It unifies SpMM/SDDMM and CSR attention through hub-aware tiling and a persistent cache, enabling deterministic replay across device, graph signature, feature width $F$, and operation. Real-world results show parity with vendor baselines at bandwidth-bound widths and up to $4.7\times$ kernel-level speedups under skew and small $F$, with synthetic tests highlighting robustness. The work releases CUDA sources, Python bindings, a reproducible harness, and replayable cache logs for evaluation.

Abstract

Sparse GNN aggregations (CSR SpMM/SDDMM) vary widely in performance with degree skew, feature width, and GPU micro-architecture. We present AutoSAGE, an input-aware CUDA scheduler that chooses tiling and mapping per input using a lightweight estimate refined by on-device micro-probes, with a guardrail that safely falls back to vendor kernels and a persistent cache for deterministic replay. AutoSAGE covers SpMM and SDDMM and composes into a CSR attention pipeline (SDDMM -> row-softmax -> SpMM). On Reddit and OGBN-Products, it matches vendor baselines at bandwidth-bound feature widths and finds gains at small widths; on synthetic sparsity and skew stress tests it achieves up to 4.7x kernel-level speedups. We release CUDA sources, Python bindings, a reproducible harness, and replayable cache logs.

AutoSAGE: Input-Aware CUDA Scheduling for Sparse GNN Aggregation (SpMM/SDDMM) and CSR Attention

TL;DR

AutoSAGE tackles throughput variability in CSR SpMM/SDDMM for GNNs on GPUs by selecting input-specific tiling and mapping via a lightweight estimate refined by on-device micro-probes, guarded by a regression-safe guardrail and a replayable cache. It unifies SpMM/SDDMM and CSR attention through hub-aware tiling and a persistent cache, enabling deterministic replay across device, graph signature, feature width , and operation. Real-world results show parity with vendor baselines at bandwidth-bound widths and up to kernel-level speedups under skew and small , with synthetic tests highlighting robustness. The work releases CUDA sources, Python bindings, a reproducible harness, and replayable cache logs for evaluation.

Abstract

Sparse GNN aggregations (CSR SpMM/SDDMM) vary widely in performance with degree skew, feature width, and GPU micro-architecture. We present AutoSAGE, an input-aware CUDA scheduler that chooses tiling and mapping per input using a lightweight estimate refined by on-device micro-probes, with a guardrail that safely falls back to vendor kernels and a persistent cache for deterministic replay. AutoSAGE covers SpMM and SDDMM and composes into a CSR attention pipeline (SDDMM -> row-softmax -> SpMM). On Reddit and OGBN-Products, it matches vendor baselines at bandwidth-bound feature widths and finds gains at small widths; on synthetic sparsity and skew stress tests it achieves up to 4.7x kernel-level speedups. We release CUDA sources, Python bindings, a reproducible harness, and replayable cache logs.

Paper Structure

This paper contains 32 sections, 1 theorem, 7 figures, 10 tables.

Key Result

Proposition 1

With $\alpha\le 1$, the chosen runtime $t_{\text{chosen}}\le t_b$. Consequently, AutoSAGE does not regress versus baseline under identical input and device.

Figures (7)

  • Figure 1: Speedup vs. $F$ on Products.
  • Figure 2: Products: wide $F$ sweep.
  • Figure 3: Reddit: guardrail $=0.98$.
  • Figure 4: Reddit: guardrail $=0.95$.
  • Figure 5: Reddit: wide $F$ sweep.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 1: Non-regression under guardrail
  • proof