Table of Contents
Fetching ...

Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM

Lian Liu, Shixin Zhao, Bing Li, Haimeng Ren, Zhaohui Xu, Mengdi Wang, Xiaowei Li, Yinhe Han, Ying Wang

TL;DR

This work tackles the high cost of deploying billion-parameter LLMs on server-grade GPUs by introducing Hermes, a budget-friendly inference system that augments a consumer GPU with NDP-DIMMs to expand memory capacity and provide near-data computation. It leverages the intrinsic activation sparsity of LLMs to partition work into hot neurons processed on the GPU and cold neurons offloaded to NDP-DIMMs, guided by a lightweight online predictor and balanced by a window-based scheduling mechanism. Offline ILP-based neuron mapping provides an initial optimal placement, while online adaptations using token-wise similarity and layer-wise correlation maintain high throughput during inference. Compared with state-of-the-art offloading systems, Hermes delivers substantial speedups on models like LLaMA2-70B at a fraction of the cost, demonstrating practical viability for local deployment and cost-effective LLM serving on commodity hardware.

Abstract

The billion-scale Large Language Models (LLMs) need deployment on expensive server-grade GPUs with large-storage HBMs and abundant computation capability. As LLM-assisted services become popular, achieving cost-effective LLM inference on budget-friendly hardware becomes the trend. Extensive researches relocate LLM parameters from expensive GPUs to host memory. However, the restricted bandwidth between the host and GPU memory limits the inference performance. This work introduces Hermes, a budget-friendly system that leverages the near-data processing (NDP) within commodity DRAM DIMMs to enhance the performance of a single consumer-grade GPU, achieving efficient LLM inference. The inherent activation sparsity in LLMs naturally divides weight parameters into two categories, termed ``hot" and ``cold" neurons, respectively. Hot neurons, which consist of only approximately 20\% of all weight parameters, account for 80\% of the total computational load, while cold neurons make up the other 80\% of parameters but are responsible for just 20\% of the computational load. Therefore, we propose a heterogeneous computing strategy: mapping hot neurons to a single computation-efficient GPU, while offloading cold neurons to NDP-DIMMs, which offer large memory size but limited computation capabilities. Meanwhile, the dynamic nature of activation sparsity needs a real-time partition of hot/cold neurons and adaptive remapping of cold neurons across multiple NDP-DIMM modules. Therefore, we introduce a lightweight predictor optimizing real-time neuron partition and adjustment between GPU and NDP-DIMMs. We also utilize a window-based online scheduling mechanism to maintain load balance among NDP-DIMM modules. Hermes facilitates the deployment of LLaMA2-70B on consumer-grade hardware at 13.75 tokens/s and realizes an average 75.24$\times$ speedup over the state-of-the-art offloading-based inference system.

Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM

TL;DR

This work tackles the high cost of deploying billion-parameter LLMs on server-grade GPUs by introducing Hermes, a budget-friendly inference system that augments a consumer GPU with NDP-DIMMs to expand memory capacity and provide near-data computation. It leverages the intrinsic activation sparsity of LLMs to partition work into hot neurons processed on the GPU and cold neurons offloaded to NDP-DIMMs, guided by a lightweight online predictor and balanced by a window-based scheduling mechanism. Offline ILP-based neuron mapping provides an initial optimal placement, while online adaptations using token-wise similarity and layer-wise correlation maintain high throughput during inference. Compared with state-of-the-art offloading systems, Hermes delivers substantial speedups on models like LLaMA2-70B at a fraction of the cost, demonstrating practical viability for local deployment and cost-effective LLM serving on commodity hardware.

Abstract

The billion-scale Large Language Models (LLMs) need deployment on expensive server-grade GPUs with large-storage HBMs and abundant computation capability. As LLM-assisted services become popular, achieving cost-effective LLM inference on budget-friendly hardware becomes the trend. Extensive researches relocate LLM parameters from expensive GPUs to host memory. However, the restricted bandwidth between the host and GPU memory limits the inference performance. This work introduces Hermes, a budget-friendly system that leverages the near-data processing (NDP) within commodity DRAM DIMMs to enhance the performance of a single consumer-grade GPU, achieving efficient LLM inference. The inherent activation sparsity in LLMs naturally divides weight parameters into two categories, termed ``hot" and ``cold" neurons, respectively. Hot neurons, which consist of only approximately 20\% of all weight parameters, account for 80\% of the total computational load, while cold neurons make up the other 80\% of parameters but are responsible for just 20\% of the computational load. Therefore, we propose a heterogeneous computing strategy: mapping hot neurons to a single computation-efficient GPU, while offloading cold neurons to NDP-DIMMs, which offer large memory size but limited computation capabilities. Meanwhile, the dynamic nature of activation sparsity needs a real-time partition of hot/cold neurons and adaptive remapping of cold neurons across multiple NDP-DIMM modules. Therefore, we introduce a lightweight predictor optimizing real-time neuron partition and adjustment between GPU and NDP-DIMMs. We also utilize a window-based online scheduling mechanism to maintain load balance among NDP-DIMM modules. Hermes facilitates the deployment of LLaMA2-70B on consumer-grade hardware at 13.75 tokens/s and realizes an average 75.24 speedup over the state-of-the-art offloading-based inference system.

Paper Structure

This paper contains 41 sections, 5 equations, 17 figures, 2 tables, 1 algorithm.

Figures (17)

  • Figure 1: (a) Existing offloading solutions view host memory as the augmented memory, but cause burdensome data transfer on PCIe. (b) Partitioning the weight matrix in each layer, and utilizing NDP-DIMMs to handle poor computation intensity parts, only introduces negligible data transfer.
  • Figure 2: The LLM inference procedure and architecture.
  • Figure 3: The inherent activation sparsity within certain LLMs is further enhanced to achieve higher sparsity across various LLMs.
  • Figure 4: Distribution patterns for activation sparsity. (a) The adjacent tokens enjoy high similarity on activated neurons for various models and datasets. (b) The activated neurons between consecutive layers are highly correlated.
  • Figure 5: Overview of our proposed Hermes System. (a) Hermes augments GPU memory with NDP-DIMMs, and utilizes a scheduler to control the inference workflow. (b) Multiple NDP-DIMMs are connected to support LLM inference and inter-DIMM communication.
  • ...and 12 more figures