HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing
Myunghyun Rhee, Joonseop Sim, Taeyoung Ahn, Seungyong Lee, Daegun Yoon, Euiseok Kim, Kyoung Park, Youngpyo Joo, Hoshik Kim
TL;DR
The paper tackles the memory-bound nature of attention in Transformer-based LLM inference by introducing the High-bandwidth Processing Unit (HPU), a memory-centric co-processor that offloads KV-cache attention to free GPU compute capacity. Implemented as a PCIe FPGA-based prototype with HBM, the HPU enables a GPU-HPU heterogeneous system that scales throughput and energy efficiency without increasing the number of GPUs. Across Llama 2 7B workloads, the GPU-HPU system achieves up to 4.1x throughput and 4.6x energy efficiency improvements, with MFU gains up to ~44% on mid-range GPUs. This approach offers a practical path to scalable, cost-effective LLM inference by balancing memory-bound and compute-bound phases through specialized hardware acceleration.
Abstract
The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth Processing Unit (HPU), a memoryintensive co-processor that enhances GPU resource utilization during large-batched LLM inference. By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency. Also, the HPU, as an add-on card, scales out to accommodate surging memory demands driven by large batch sizes and extended sequence lengths. In this paper, we show the HPU prototype implemented with PCIe-based FPGA cards mounted on a GPU system. Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system, providing scalability without increasing the number of GPUs.
