Analysis and Optimized CXL-Attached Memory Allocation for Long-Context LLM Fine-Tuning
Yong-Cheng Liaw, Shuo-Han Chen
TL;DR
This paper tackles the memory bottlenecks of long-context LLM fine-tuning on GPU-constrained systems by leveraging CXL-attached memory as an extension to system RAM. It introduces a fine-grained PyTorch memory allocation extension for per-tensor placement and a CXL-aware memory allocator that uses a latency-first greedy strategy to partition tensors between local DRAM and CXL devices, placing latency-sensitive data in DRAM and bandwidth-bound data in CXL. Empirical results on 7B and 12B models with 4K–32K contexts show that the proposed approach recovers 97-99% of DRAM-only throughput with a single CXL AIC and ~100% with two AICs, outperforming naive interleaving by up to 21%. The work demonstrates that carefully managed CXL memory can scale long-context fine-tuning beyond DRAM limits, providing a practical path to high-performance training on heterogeneous memory systems.
Abstract
The substantial memory requirements of Large Language Models (LLMs), particularly for long-context fine-tuning, have renewed interest in CPU offloading to augment limited GPU memory. However, as context lengths grow, relying on CPU memory for intermediate states introduces a significant bottleneck that can exhaust the capacity of mainstream client platforms. To address this limitation, this work investigates the effectiveness of Compute Express Link (CXL) add-in card (AIC) memory as an extension to CPU memory, enabling larger model sizes and longer context lengths during fine-tuning. Extensive benchmarking reveals two critical challenges. First, current deep learning frameworks such as PyTorch lack fine-grained, per-tensor control over NUMA memory allocation, exposing only coarse, process-level policies. Second, due to this lack of control, when the memory footprint of fine-tuning is offloaded across local DRAM and CXL-attached memory, naively placing optimizer data in higher-latency CXL leads to substantial slowdowns in the optimizer step (e.g., 4x once data exceeds 20M elements). To overcome these challenges, this work introduces a PyTorch extension that enables tensor-level system memory control and a CXL-aware memory allocator that pins latency-critical tensors in local DRAM while maximizing bandwidth by striping latency-tolerant tensors across one or more CXL devices. Evaluated on a real hardware setup with 7B and 12B models, 4K-32K contexts, and a single GPU, our approach recovers throughput to 97-99% of DRAM-only with a single AIC and approximately 100% with two AICs, delivering up to 21% improvement over naive interleaving while preserving DRAM-like DMA bandwidth for GPU transfers. These results show that carefully managed CXL-attached memory is a practical path to scaling long-context fine-tuning beyond DRAM limits.
