Table of Contents
Fetching ...

A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

Abhinav Jangda, Saeed Maleki, Maryam Mehri Dehnavi, Madan Musuvathi, Olli Saarikivi

TL;DR

This work tackles under-utilization in GPU inference caused by dependent tile-based computations across multiple kernels. It presents cuSync, a framework that synchronizes dependent tiles using streams, semaphores, and policies, plus cuSyncGen to auto-generate tile orders and synchronization policies. Empirical results across GPT-3, LLaMA, ResNet-38, and VGG-19 show meaningful end-to-end speedups, validating the approach's applicability to GeMM, Conv2D, and related tile-based kernels. The method offers practical impact by reducing inference times with modest software changes and broad applicability beyond GeMM to diverse tile-based ML workloads.

Abstract

Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation into independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in one or more waves. However, the number of tiles is not always a multiple of the number of execution units. Thus, tiles executed in the final wave can under-utilize the GPU. To address this issue, we present cuSync, a framework for synchronizing dependent kernels using a user-defined fine-grained synchronization policy to improve the GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing independent tiles of dependent kernels concurrently. We also present a compiler to generate diverse fine-grained synchronization policies based on dependencies between kernels. Our experiments found that synchronizing CUDA kernels using cuSync reduces the inference times of four popular ML models: MegatronLM GPT-3 by up to 15%, LLaMA by up to 14%, ResNet-38 by up to 22%, and VGG-19 by up to 16% over several batch sizes.

A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

TL;DR

This work tackles under-utilization in GPU inference caused by dependent tile-based computations across multiple kernels. It presents cuSync, a framework that synchronizes dependent tiles using streams, semaphores, and policies, plus cuSyncGen to auto-generate tile orders and synchronization policies. Empirical results across GPT-3, LLaMA, ResNet-38, and VGG-19 show meaningful end-to-end speedups, validating the approach's applicability to GeMM, Conv2D, and related tile-based kernels. The method offers practical impact by reducing inference times with modest software changes and broad applicability beyond GeMM to diverse tile-based ML workloads.

Abstract

Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation into independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in one or more waves. However, the number of tiles is not always a multiple of the number of execution units. Thus, tiles executed in the final wave can under-utilize the GPU. To address this issue, we present cuSync, a framework for synchronizing dependent kernels using a user-defined fine-grained synchronization policy to improve the GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing independent tiles of dependent kernels concurrently. We also present a compiler to generate diverse fine-grained synchronization policies based on dependencies between kernels. Our experiments found that synchronizing CUDA kernels using cuSync reduces the inference times of four popular ML models: MegatronLM GPT-3 by up to 15%, LLaMA by up to 14%, ResNet-38 by up to 22%, and VGG-19 by up to 16% over several batch sizes.
Paper Structure (35 sections, 8 figures, 5 tables)

This paper contains 35 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Thread block execution with existing stream synchronization and fine-grained synchronization on 4 SMs for two dependent GeMM kernels: $C_{12\times 8} = A_{12\times 8} \times B_{8\times 8}$ and $E_{12\times 8} = C_{12\times 8} \times B_{8\times 8}$.
  • Figure 2: Architecture of Multi-Layer Perceptron (MLP) and Attention of GPT-3, where H is 12288. Model parallelism on 8 GPUs divides weight matrices of both layers among 8 GPUs. Both takes an input matrix X of shape $\left[\texttt{B}, \texttt{S}, \texttt{H}\right]$ and obtain the result XW$_{12}$ of the same shape. B is the number of batched requests, S is the sequence length, H is the hidden dimension, and S' is the sum of processed and generated tokens.
  • Figure 3: The LLaMA MLP contains three weight matrices. With model parallelism on 8 GPUs, these matrices are: W$_1$ of shape $\left[\texttt{H}, \frac{\texttt{H}}{3}\right]$, V of shape $\left[\texttt{H}, \frac{\texttt{H}}{3}\right]$, and W$_2$ of shape $\left[\frac{\texttt{H}}{3}, \texttt{H}\right]$.
  • Figure 4: Fine-grained synchronization of two GeMMs of MLP using cuSync's TileSync and RowSync policies.
  • Figure 5: Dependencies in the cuSyncGen DSL. TileM and TileN are tile size of GeMMs in row and column respectively.
  • ...and 3 more figures