Table of Contents
Fetching ...

Analysis and Optimized CXL-Attached Memory Allocation for Long-Context LLM Fine-Tuning

Yong-Cheng Liaw, Shuo-Han Chen

TL;DR

This paper tackles the memory bottlenecks of long-context LLM fine-tuning on GPU-constrained systems by leveraging CXL-attached memory as an extension to system RAM. It introduces a fine-grained PyTorch memory allocation extension for per-tensor placement and a CXL-aware memory allocator that uses a latency-first greedy strategy to partition tensors between local DRAM and CXL devices, placing latency-sensitive data in DRAM and bandwidth-bound data in CXL. Empirical results on 7B and 12B models with 4K–32K contexts show that the proposed approach recovers 97-99% of DRAM-only throughput with a single CXL AIC and ~100% with two AICs, outperforming naive interleaving by up to 21%. The work demonstrates that carefully managed CXL memory can scale long-context fine-tuning beyond DRAM limits, providing a practical path to high-performance training on heterogeneous memory systems.

Abstract

The substantial memory requirements of Large Language Models (LLMs), particularly for long-context fine-tuning, have renewed interest in CPU offloading to augment limited GPU memory. However, as context lengths grow, relying on CPU memory for intermediate states introduces a significant bottleneck that can exhaust the capacity of mainstream client platforms. To address this limitation, this work investigates the effectiveness of Compute Express Link (CXL) add-in card (AIC) memory as an extension to CPU memory, enabling larger model sizes and longer context lengths during fine-tuning. Extensive benchmarking reveals two critical challenges. First, current deep learning frameworks such as PyTorch lack fine-grained, per-tensor control over NUMA memory allocation, exposing only coarse, process-level policies. Second, due to this lack of control, when the memory footprint of fine-tuning is offloaded across local DRAM and CXL-attached memory, naively placing optimizer data in higher-latency CXL leads to substantial slowdowns in the optimizer step (e.g., 4x once data exceeds 20M elements). To overcome these challenges, this work introduces a PyTorch extension that enables tensor-level system memory control and a CXL-aware memory allocator that pins latency-critical tensors in local DRAM while maximizing bandwidth by striping latency-tolerant tensors across one or more CXL devices. Evaluated on a real hardware setup with 7B and 12B models, 4K-32K contexts, and a single GPU, our approach recovers throughput to 97-99% of DRAM-only with a single AIC and approximately 100% with two AICs, delivering up to 21% improvement over naive interleaving while preserving DRAM-like DMA bandwidth for GPU transfers. These results show that carefully managed CXL-attached memory is a practical path to scaling long-context fine-tuning beyond DRAM limits.

Analysis and Optimized CXL-Attached Memory Allocation for Long-Context LLM Fine-Tuning

TL;DR

This paper tackles the memory bottlenecks of long-context LLM fine-tuning on GPU-constrained systems by leveraging CXL-attached memory as an extension to system RAM. It introduces a fine-grained PyTorch memory allocation extension for per-tensor placement and a CXL-aware memory allocator that uses a latency-first greedy strategy to partition tensors between local DRAM and CXL devices, placing latency-sensitive data in DRAM and bandwidth-bound data in CXL. Empirical results on 7B and 12B models with 4K–32K contexts show that the proposed approach recovers 97-99% of DRAM-only throughput with a single CXL AIC and ~100% with two AICs, outperforming naive interleaving by up to 21%. The work demonstrates that carefully managed CXL memory can scale long-context fine-tuning beyond DRAM limits, providing a practical path to high-performance training on heterogeneous memory systems.

Abstract

The substantial memory requirements of Large Language Models (LLMs), particularly for long-context fine-tuning, have renewed interest in CPU offloading to augment limited GPU memory. However, as context lengths grow, relying on CPU memory for intermediate states introduces a significant bottleneck that can exhaust the capacity of mainstream client platforms. To address this limitation, this work investigates the effectiveness of Compute Express Link (CXL) add-in card (AIC) memory as an extension to CPU memory, enabling larger model sizes and longer context lengths during fine-tuning. Extensive benchmarking reveals two critical challenges. First, current deep learning frameworks such as PyTorch lack fine-grained, per-tensor control over NUMA memory allocation, exposing only coarse, process-level policies. Second, due to this lack of control, when the memory footprint of fine-tuning is offloaded across local DRAM and CXL-attached memory, naively placing optimizer data in higher-latency CXL leads to substantial slowdowns in the optimizer step (e.g., 4x once data exceeds 20M elements). To overcome these challenges, this work introduces a PyTorch extension that enables tensor-level system memory control and a CXL-aware memory allocator that pins latency-critical tensors in local DRAM while maximizing bandwidth by striping latency-tolerant tensors across one or more CXL devices. Evaluated on a real hardware setup with 7B and 12B models, 4K-32K contexts, and a single GPU, our approach recovers throughput to 97-99% of DRAM-only with a single AIC and approximately 100% with two AICs, delivering up to 21% improvement over naive interleaving while preserving DRAM-like DMA bandwidth for GPU transfers. These results show that carefully managed CXL-attached memory is a practical path to scaling long-context fine-tuning beyond DRAM limits.

Paper Structure

This paper contains 25 sections, 15 figures, 1 table, 1 algorithm.

Figures (15)

  • Figure 1: Example of long-context CPU offloading with activation checkpointing with a transformer model composed of 4 transformer blocks. Arrows indicate data transfers over PCIe: $P_{i}$ represent model parameters (e.g., attention projection parameters, feed-forward network parameters) for a specific block. $A_{i}$ represents checkpointed input activations for the block. $G_{i}$ represents gradients corresponding to the parameters of the block. The numbered steps illustrate the data movement and computation flow.
  • Figure 2: System memory requirement scaling for 12B across varying context lengths with a batch size of 5.
  • Figure 3: Throughput and system memory requirement scaling for 12B across batch sizes with a 4K context length.
  • Figure 4: Comparison of memory access data paths and latencies between local memory and CXL-attached memory
  • Figure 5: Latency of the CPU-based Adam optimizer step with a growing number of parameters.
  • ...and 10 more figures