Table of Contents
Fetching ...

Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference

Donghyeon Joo, Ramyad Hadidi, Soheil Feizi, Bahar Asgari

TL;DR

Endor tackles the bottleneck of offloaded LLM inference by compressing unstructured sparse weights with a bitmap-based sparse format, significantly reducing weight transfer volume while keeping decompression overhead low. The method is compatible with existing pruning and, to a degree, quantization, enabling joint optimization that further lowers latency. Empirical results on OPT-66B and Llama2-70B show substantial speedups (up to ~2.4x) over dense offloading, especially when using direct SSD–GPU transfer via GPUDirect Storage. This approach has practical implications for running large models on constrained hardware, enabling more layers to reside in CPU memory or on SSD without sacrificing throughput or accuracy substantially.

Abstract

The increasing size of large language models (LLMs) challenges their usage on resource-constrained platforms. For example, memory on modern GPUs is insufficient to hold LLMs that are hundreds of Gigabytes in size. Offloading is a popular method to escape this constraint by storing weights of an LLM model to host CPU memory and SSD, then loading each weight to GPU before every use. In our case study of offloaded inference, we found that due to the low bandwidth between storage devices and GPU, the latency of transferring large model weights from its offloaded location to GPU memory becomes the critical bottleneck with actual compute taking nearly 0% of runtime. To effectively reduce the weight transfer latency, we propose a novel sparse format that compresses the unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead. Endor achieves this by expressing the positions of non-zero elements with a bitmap. Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x. When direct weight transfer from SSD to GPU is leveraged, Endor achieves 2.25x speedup on OPT-66B and 2.37x speedup on Llama2-70B.

Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference

TL;DR

Endor tackles the bottleneck of offloaded LLM inference by compressing unstructured sparse weights with a bitmap-based sparse format, significantly reducing weight transfer volume while keeping decompression overhead low. The method is compatible with existing pruning and, to a degree, quantization, enabling joint optimization that further lowers latency. Empirical results on OPT-66B and Llama2-70B show substantial speedups (up to ~2.4x) over dense offloading, especially when using direct SSD–GPU transfer via GPUDirect Storage. This approach has practical implications for running large models on constrained hardware, enabling more layers to reside in CPU memory or on SSD without sacrificing throughput or accuracy substantially.

Abstract

The increasing size of large language models (LLMs) challenges their usage on resource-constrained platforms. For example, memory on modern GPUs is insufficient to hold LLMs that are hundreds of Gigabytes in size. Offloading is a popular method to escape this constraint by storing weights of an LLM model to host CPU memory and SSD, then loading each weight to GPU before every use. In our case study of offloaded inference, we found that due to the low bandwidth between storage devices and GPU, the latency of transferring large model weights from its offloaded location to GPU memory becomes the critical bottleneck with actual compute taking nearly 0% of runtime. To effectively reduce the weight transfer latency, we propose a novel sparse format that compresses the unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead. Endor achieves this by expressing the positions of non-zero elements with a bitmap. Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x. When direct weight transfer from SSD to GPU is leveraged, Endor achieves 2.25x speedup on OPT-66B and 2.37x speedup on Llama2-70B.
Paper Structure (27 sections, 16 figures, 2 algorithms)

This paper contains 27 sections, 16 figures, 2 algorithms.

Figures (16)

  • Figure 1: Memory/SSD configuration including capacity and measured bandwidth.
  • Figure 2: Execution time comparison of offloaded OPT layers.
  • Figure 3: Timeline of an SSD-mapped OPT layer.
  • Figure 4: Sparse format for offloaded model weights. Bitmap is stored as a 1-d vector.
  • Figure 5: Timeline of offloaded execution of fully-connected operation
  • ...and 11 more figures