Table of Contents
Fetching ...

MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning

Yong-Cheng Liaw, Shuo-Han Chen

TL;DR

MemAscend tackles a largely overlooked bottleneck in SSD-offloaded LLM fine-tuning: system memory fragmentation and overhead. By introducing an adaptive buffer pool, alignment-free pinned memory allocation, a fused overflow check, and a Direct NVMe engine, MemAscend reclaims substantial system memory and enables larger models, longer context, and higher throughput on modest hardware. The approach integrates smoothly with existing memory optimizations such as ZeRO-Infinity, Liger-Kernel, Flash-Attention, and offloaded gradient checkpointing, yielding average peak-memory reductions around 55.7% and notable I/O and latency improvements. The results show practical impact for resource-constrained settings, lowering the hardware barrier for full-parameter fine-tuning and enabling cost-effective large-scale training.

Abstract

Owing to the huge success of generative artificial intelligence (AI), large language models (LLMs) have emerged as a core subclass, underpinning applications such as question answering, text generation, and code completion. While fine-tuning these models on domain-specific data can yield significant performance gains, it also poses daunting computational challenges, especially for researchers and small organizations with limited hardware resources. Although SSD offloading (i.e., ZeRO-Infinity) has emerged as a viable strategy to overcome the GPU memory barrier via leveraging both system memory (i.e., CPU DRAM) and storage space (i.e., solid-state devices, SSDs), its design primarily targets model-centric performance issues. As a result, key system-level issues, including system memory fragmentation, inefficient pinned buffer allocation, peak CPU usage spikes, and file system overhead, remain unaddressed, stifling scalability and inflating costs. Such an observation motivates this paper to introduce MemAscend, a framework that systematically tackles the underexplored system memory bottlenecks in SSD-offloaded LLM training, with a focus on resource-constrained environments. By streamlining pinned-memory allocation, eradicating fragmentation, and mitigating peak overhead, MemAscend reclaims a substantial system memory budget, enabling larger models, longer context windows, and higher batch sizes without exceeding modest hardware limits. Across diverse LLM benchmarks, MemAscend reduces peak system-memory consumption by an average of 55.7% compared with standard SSD offloading techniques, lowering the hardware barrier for fine-tuning and unlocking new possibilities for cost-effective large-scale training on limited-resource machines.

MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning

TL;DR

MemAscend tackles a largely overlooked bottleneck in SSD-offloaded LLM fine-tuning: system memory fragmentation and overhead. By introducing an adaptive buffer pool, alignment-free pinned memory allocation, a fused overflow check, and a Direct NVMe engine, MemAscend reclaims substantial system memory and enables larger models, longer context, and higher throughput on modest hardware. The approach integrates smoothly with existing memory optimizations such as ZeRO-Infinity, Liger-Kernel, Flash-Attention, and offloaded gradient checkpointing, yielding average peak-memory reductions around 55.7% and notable I/O and latency improvements. The results show practical impact for resource-constrained settings, lowering the hardware barrier for full-parameter fine-tuning and enabling cost-effective large-scale training.

Abstract

Owing to the huge success of generative artificial intelligence (AI), large language models (LLMs) have emerged as a core subclass, underpinning applications such as question answering, text generation, and code completion. While fine-tuning these models on domain-specific data can yield significant performance gains, it also poses daunting computational challenges, especially for researchers and small organizations with limited hardware resources. Although SSD offloading (i.e., ZeRO-Infinity) has emerged as a viable strategy to overcome the GPU memory barrier via leveraging both system memory (i.e., CPU DRAM) and storage space (i.e., solid-state devices, SSDs), its design primarily targets model-centric performance issues. As a result, key system-level issues, including system memory fragmentation, inefficient pinned buffer allocation, peak CPU usage spikes, and file system overhead, remain unaddressed, stifling scalability and inflating costs. Such an observation motivates this paper to introduce MemAscend, a framework that systematically tackles the underexplored system memory bottlenecks in SSD-offloaded LLM training, with a focus on resource-constrained environments. By streamlining pinned-memory allocation, eradicating fragmentation, and mitigating peak overhead, MemAscend reclaims a substantial system memory budget, enabling larger models, longer context windows, and higher batch sizes without exceeding modest hardware limits. Across diverse LLM benchmarks, MemAscend reduces peak system-memory consumption by an average of 55.7% compared with standard SSD offloading techniques, lowering the hardware barrier for fine-tuning and unlocking new possibilities for cost-effective large-scale training on limited-resource machines.

Paper Structure

This paper contains 46 sections, 1 equation, 21 figures, 6 tables, 1 algorithm.

Figures (21)

  • Figure 1: ZeRO-Infinity data flow during the backward pass for an $n$-layer model with three data-parallel ranks. In this example, network refers to the use of peer-to-peer communication operations, such as Allgather or ReduceScatter, which synchronize data (i.e., tensors) across multiple GPUs during distributed training. Parameters P0(i), which are the first tensors for the i-th GPU, are moved to each GPU. The gradients G0(i), which correspond to P0(i), are processed and offloaded. The optimizer states O0(i), which correspond to P0(i) and G0(i), are updated on the CPU.
  • Figure 2: Comparison of GPU memory usage between short and long context lengths with a batch size of 4 for training an 8-billion-parameter model, across different GPU memory efficiency optimizations. The y-axis uses a base-10 logarithmic scale. GC represents enabled Gradient Checkpointing, Liger/Flash represent enabled Liger-Kernel and FlashAttention, and Offloaded-GC represents enabled Offloaded Gradient Checkpointing.
  • Figure 3: Tensor lifetimes during gradient overflow checks in ZeRO-Infinity.
  • Figure 4: System memory overhead in the original SSD offloading system across different models. This demonstrates that the original SSD offloading system limited the trainable model size, context length, and batch size due to memory overhead.
  • Figure 5: System architecture of MemAscend, highlighting optimized low-level kernels and memory management for reduced memory usage and enhanced performance.
  • ...and 16 more figures