Table of Contents
Fetching ...

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

Cong Guo, Rui Zhang, Jiale Xu, Jingwen Leng, Zihan Liu, Ziyu Huang, Minyi Guo, Hao Wu, Shouren Zhao, Junping Zhao, Ke Zhang

TL;DR

GMLake introduces a low-level GPU memory management framework that uses virtual memory stitching (VMS) to fuse non-contiguous memory into unified blocks, addressing fragmentation caused by irregular memory requests in large-scale DNN training. Built atop CUDA's VM APIs, GMLake provides a virtual memory API, a two-tier memory pool (pPool and sPool), and an allocator with Alloc/Split/Stitch operations, achieving near-fragmentation-free memory management while remaining transparent to ML models and memory-reduction techniques. Empirical results show average memory savings of $9.2$ GB (up to $25$ GB) and fragmentation reductions of $15\%$ (up to $33\%$) across eight LLMs on A100-80GB, and strong scalability across memory-efficient strategies, multi-GPU scale-out, and multiple platforms, with end-to-end throughput comparable to the baseline. The work demonstrates practical impact for training larger models or increasing batch sizes by leveraging non-contiguous memory mapping and stitching to optimize GPU memory utilization with minimal overhead.

Abstract

Large-scale deep neural networks (DNNs), such as large language models (LLMs), have revolutionized the artificial intelligence (AI) field and become increasingly popular. However, training or fine-tuning such models requires substantial computational power and resources, where the memory capacity of a single acceleration device like a GPU is one of the most important bottlenecks. Owing to the prohibitively large overhead (e.g., $10 \times$) of GPUs' native memory allocator, DNN frameworks like PyTorch and TensorFlow adopt a caching allocator that maintains a memory pool with a splitting mechanism for fast memory (de)allocation. Unfortunately, the caching allocator's efficiency degrades quickly for popular memory reduction techniques such as recomputation, offloading, distributed training, and low-rank adaptation. The primary reason is that those memory reduction techniques introduce frequent and irregular memory (de)allocation requests, leading to severe fragmentation problems for the splitting-based caching allocator. To mitigate this fragmentation problem, we propose a novel memory allocation framework based on low-level GPU virtual memory management called GPU memory lake (GMLake). GMLake employs a novel virtual memory stitching (VMS) mechanism, which can fuse or combine non-contiguous memory blocks with a virtual memory address mapping. GMLake can reduce an average of 9.2 GB (up to 25 GB) GPU memory usage and 15% (up to 33% ) fragmentation among eight LLM models on GPU A100 with 80 GB memory. GMLake is completely transparent to the DNN models and memory reduction techniques and ensures the seamless execution of resource-intensive deep-learning tasks. We have open-sourced GMLake at https://github.com/intelligent-machine-learning/glake/tree/main/GMLake.

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

TL;DR

GMLake introduces a low-level GPU memory management framework that uses virtual memory stitching (VMS) to fuse non-contiguous memory into unified blocks, addressing fragmentation caused by irregular memory requests in large-scale DNN training. Built atop CUDA's VM APIs, GMLake provides a virtual memory API, a two-tier memory pool (pPool and sPool), and an allocator with Alloc/Split/Stitch operations, achieving near-fragmentation-free memory management while remaining transparent to ML models and memory-reduction techniques. Empirical results show average memory savings of GB (up to GB) and fragmentation reductions of (up to ) across eight LLMs on A100-80GB, and strong scalability across memory-efficient strategies, multi-GPU scale-out, and multiple platforms, with end-to-end throughput comparable to the baseline. The work demonstrates practical impact for training larger models or increasing batch sizes by leveraging non-contiguous memory mapping and stitching to optimize GPU memory utilization with minimal overhead.

Abstract

Large-scale deep neural networks (DNNs), such as large language models (LLMs), have revolutionized the artificial intelligence (AI) field and become increasingly popular. However, training or fine-tuning such models requires substantial computational power and resources, where the memory capacity of a single acceleration device like a GPU is one of the most important bottlenecks. Owing to the prohibitively large overhead (e.g., ) of GPUs' native memory allocator, DNN frameworks like PyTorch and TensorFlow adopt a caching allocator that maintains a memory pool with a splitting mechanism for fast memory (de)allocation. Unfortunately, the caching allocator's efficiency degrades quickly for popular memory reduction techniques such as recomputation, offloading, distributed training, and low-rank adaptation. The primary reason is that those memory reduction techniques introduce frequent and irregular memory (de)allocation requests, leading to severe fragmentation problems for the splitting-based caching allocator. To mitigate this fragmentation problem, we propose a novel memory allocation framework based on low-level GPU virtual memory management called GPU memory lake (GMLake). GMLake employs a novel virtual memory stitching (VMS) mechanism, which can fuse or combine non-contiguous memory blocks with a virtual memory address mapping. GMLake can reduce an average of 9.2 GB (up to 25 GB) GPU memory usage and 15% (up to 33% ) fragmentation among eight LLM models on GPU A100 with 80 GB memory. GMLake is completely transparent to the DNN models and memory reduction techniques and ensures the seamless execution of resource-intensive deep-learning tasks. We have open-sourced GMLake at https://github.com/intelligent-machine-learning/glake/tree/main/GMLake.
Paper Structure (47 sections, 1 equation, 15 figures, 3 tables)

This paper contains 47 sections, 1 equation, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Representative example of memory allocation problem. The original splitting method can boost GPU memory utilization but cause fragmentation. Our proposed virtual memory stitching can complement and optimize the memory fragmentation issues.
  • Figure 2: Three memory management strategies.
  • Figure 3: Memory utilization with five method combinations.
  • Figure 4: Memory utilization with different GPU numbers.
  • Figure 5: Memory footprint of GPT-Neox-20B training.
  • ...and 10 more figures