Table of Contents
Fetching ...

HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

Myunghyun Rhee, Joonseop Sim, Taeyoung Ahn, Seungyong Lee, Daegun Yoon, Euiseok Kim, Kyoung Park, Youngpyo Joo, Hoshik Kim

TL;DR

The paper tackles the memory-bound nature of attention in Transformer-based LLM inference by introducing the High-bandwidth Processing Unit (HPU), a memory-centric co-processor that offloads KV-cache attention to free GPU compute capacity. Implemented as a PCIe FPGA-based prototype with HBM, the HPU enables a GPU-HPU heterogeneous system that scales throughput and energy efficiency without increasing the number of GPUs. Across Llama 2 7B workloads, the GPU-HPU system achieves up to 4.1x throughput and 4.6x energy efficiency improvements, with MFU gains up to ~44% on mid-range GPUs. This approach offers a practical path to scalable, cost-effective LLM inference by balancing memory-bound and compute-bound phases through specialized hardware acceleration.

Abstract

The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth Processing Unit (HPU), a memoryintensive co-processor that enhances GPU resource utilization during large-batched LLM inference. By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency. Also, the HPU, as an add-on card, scales out to accommodate surging memory demands driven by large batch sizes and extended sequence lengths. In this paper, we show the HPU prototype implemented with PCIe-based FPGA cards mounted on a GPU system. Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system, providing scalability without increasing the number of GPUs.

HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

TL;DR

The paper tackles the memory-bound nature of attention in Transformer-based LLM inference by introducing the High-bandwidth Processing Unit (HPU), a memory-centric co-processor that offloads KV-cache attention to free GPU compute capacity. Implemented as a PCIe FPGA-based prototype with HBM, the HPU enables a GPU-HPU heterogeneous system that scales throughput and energy efficiency without increasing the number of GPUs. Across Llama 2 7B workloads, the GPU-HPU system achieves up to 4.1x throughput and 4.6x energy efficiency improvements, with MFU gains up to ~44% on mid-range GPUs. This approach offers a practical path to scalable, cost-effective LLM inference by balancing memory-bound and compute-bound phases through specialized hardware acceleration.

Abstract

The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth Processing Unit (HPU), a memoryintensive co-processor that enhances GPU resource utilization during large-batched LLM inference. By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency. Also, the HPU, as an add-on card, scales out to accommodate surging memory demands driven by large batch sizes and extended sequence lengths. In this paper, we show the HPU prototype implemented with PCIe-based FPGA cards mounted on a GPU system. Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system, providing scalability without increasing the number of GPUs.

Paper Structure

This paper contains 17 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: LLM inference characteristics. (a) A single Transformer layer architecture of Llama 2. In multi-batch scenarios, computations are composed of GEMM-based linear layers and GEMV-based attention layers. (b) Roofline of the A100 GPU and the trend of GEMM and GEMV characteristics as batch size increases. (c) GPU utilization from the perspective of MFU (Model FLOPS Utilization) and MBU (Model Bandwidth Utilization).
  • Figure 2: GPU-HPU heterogeneous system with SW stack.
  • Figure 5: GPU-HPU pipelined timing diagram.
  • Figure 6: Parallelism and data placement policy to optimize performance in multi-HPU environments.
  • Figure 7: HPU Prototype execution pipeline.
  • ...and 4 more figures