Table of Contents
Fetching ...

Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs

Kun Wu

TL;DR

The work addresses the growing data-efficiency bottleneck in deep learning training on GPUs by tackling data movement and memory bottlenecks. It introduces three integrated approaches: Hector, a two-level IR and code-generation framework for Relational Graph Neural Networks that minimizes data movement and memory footprint; PyTorch-Direct, a GPU-centric data-access paradigm enabling zero-copy host-memory access for GNN training via a unified tensor type; and SSDTrain, a framework that offloads activations to NVMe SSDs to overcome GPU memory limits in large-scale LLM training while preserving throughput. Empirical results demonstrate substantial speedups (up to 43.7x in RGNN training and up to 9.9x in inference) and significant memory reductions (activation memory peaks reduced by up to 47%), along with new design tools and evaluation methods (e.g., the recompute-offload-keep curve). Together, these contributions show that carefully designed code generation and runtime techniques can systematically mitigate data-management bottlenecks in data-intensive DL workloads and enable scalable training on existing hardware stacks.

Abstract

As deep learning models scale, their training cost has surged significantly. Due to both hardware advancements and limitations in current software stacks, the need for data efficiency has risen. Data efficiency refers to the effective hiding of data access latency and the avoidance of unnecessary data movements. Major challenges arise from the growing disparity between GPU memory bandwidth and computational throughput, imminent GPU memory capacity limitations, and inefficiencies in the PyTorch software stack, including a lack of device-specific PCIe transfer optimizations and high-level domain-specific abstractions. To effectively mitigate these data inefficiencies for deep learning training, this dissertation analyzes data inefficiency in representative deep training tasks, specifically in graph neural networks (GNNs) and large language models (LLMs). It then proposes novel runtime and code generation techniques to mitigate these challenges and implements these optimizations seamlessly within the PyTorch stack while maintaining strong programmability and interoperability. First, PyTorch-Direct is devised to incorporate the GPU-centric PCIe data transfer paradigm in PyTorch for GNN training. Next, Hector intermediate representation (IR) and its code generator are proposed to introduce domain-specific high-level abstraction and systematically address memory-intensive performance challenges for relational GNNs. Finally, in LLM training, the throughput has been increasingly constrained by GPU memory capacity. To mitigate this, the SSDTrain offloading framework is designed and implemented. Together, these contributions show that code generation and runtime techniques can systematically mitigate the data management bottlenecks in deep learning training, which stem from the data-intensive nature of workloads and the oversimplification inherent in the deep learning training software stack.

Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs

TL;DR

The work addresses the growing data-efficiency bottleneck in deep learning training on GPUs by tackling data movement and memory bottlenecks. It introduces three integrated approaches: Hector, a two-level IR and code-generation framework for Relational Graph Neural Networks that minimizes data movement and memory footprint; PyTorch-Direct, a GPU-centric data-access paradigm enabling zero-copy host-memory access for GNN training via a unified tensor type; and SSDTrain, a framework that offloads activations to NVMe SSDs to overcome GPU memory limits in large-scale LLM training while preserving throughput. Empirical results demonstrate substantial speedups (up to 43.7x in RGNN training and up to 9.9x in inference) and significant memory reductions (activation memory peaks reduced by up to 47%), along with new design tools and evaluation methods (e.g., the recompute-offload-keep curve). Together, these contributions show that carefully designed code generation and runtime techniques can systematically mitigate data-management bottlenecks in data-intensive DL workloads and enable scalable training on existing hardware stacks.

Abstract

As deep learning models scale, their training cost has surged significantly. Due to both hardware advancements and limitations in current software stacks, the need for data efficiency has risen. Data efficiency refers to the effective hiding of data access latency and the avoidance of unnecessary data movements. Major challenges arise from the growing disparity between GPU memory bandwidth and computational throughput, imminent GPU memory capacity limitations, and inefficiencies in the PyTorch software stack, including a lack of device-specific PCIe transfer optimizations and high-level domain-specific abstractions. To effectively mitigate these data inefficiencies for deep learning training, this dissertation analyzes data inefficiency in representative deep training tasks, specifically in graph neural networks (GNNs) and large language models (LLMs). It then proposes novel runtime and code generation techniques to mitigate these challenges and implements these optimizations seamlessly within the PyTorch stack while maintaining strong programmability and interoperability. First, PyTorch-Direct is devised to incorporate the GPU-centric PCIe data transfer paradigm in PyTorch for GNN training. Next, Hector intermediate representation (IR) and its code generator are proposed to introduce domain-specific high-level abstraction and systematically address memory-intensive performance challenges for relational GNNs. Finally, in LLM training, the throughput has been increasingly constrained by GPU memory capacity. To mitigate this, the SSDTrain offloading framework is designed and implemented. Together, these contributions show that code generation and runtime techniques can systematically mitigate the data management bottlenecks in deep learning training, which stem from the data-intensive nature of workloads and the oversimplification inherent in the deep learning training software stack.

Paper Structure

This paper contains 99 sections, 4 equations, 41 figures, 17 tables, 4 algorithms.

Figures (41)

  • Figure 1: Trend of recent GPUs for deep learning. We collect the inter-device (D2D) bandwidth, PCIe bandwidth, memory bandwidth, and floating-point throughput of Nvidia 100-level GPUs since Kepler and Google TPUs epochParameterComputeDatatechpowerupGPUSpecsDatabase2024jouppiTPUV4timothyprickettmorganLotsQuestionsGoogle2024smithNVIDIABlackwellArchitecture2024TensorProcessingUnit2017jouppiInDatacenterPerformanceAnalysis2017.
  • Figure 2: Comparison of the memory bandwidth and FP16 throughput of Nvidia B100 SXM smithNVIDIABlackwellArchitecture2024 with the arithmetic intensity of Google internal production workloads jouppiTPUV4.
  • Figure 3: The trend of enterprise SSD sequential write bandwidth techpowerupEnterpriseSSDDatabase2024. For each SSD model, only the data of the variant with maximal capacity is collected. Red lines show the growth rates predicted by quantile regression. The visualization code is adapted from Derek Jones's work derekjonesShapeCodeMemory2020.
  • Figure 4: Hierarchical breakdown of the GPT model. In training, dropout is applied to the output of each layer with red borders.
  • Figure 5: Structure of a streaming multiprocessor (SM) in an Nvidia Volta V100 GPU V100WhitepaperjiaDissectingNVIDIAVolta2018nickollsInstructionsManagingParallel2019davidm.koppelmanEE7722GPU2023nvidiaKernelProfilingGuide2024. The execution units include FP64 units, FP32 units, arithmetic logic units (ALUs), tensor cores (TCs), transcendental and data type conversion units (XUs), and load-store units (LSUs).
  • ...and 36 more figures