Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Yu Wang, Guohao Dai
TL;DR
The paper analyzes how hardware characteristics shape the acceleration of generative LLM inference, proposing a unified framework to compare software optimizations (quantization, sparsity, fast decoding, operator optimization, heterogeneous/homogeneous cooperation) across CPU, GPU, FPGA, ASIC, and PIM/NDP platforms. It introduces token-based metrics (tokens/s and tokens/J) and provides qualitative and quantitative assessments of performance across batch sizes, highlighting the relative strengths of each platform and optimization. Key contributions include a comprehensive cross-platform comparison, a synthesis of architecture families (Attention, SSM, Hybrid), and actionable insights on edge AI trends such as multimodal LLMs, inference-time compute, and energy efficiency. The work emphasizes hardware-software co-design as essential for meeting edge deployment demands and outlines a pragmatic path toward higher throughput and lower energy consumption for future edge-enabled AI systems.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation. Compared to non-generative LLMs like BERT and DeBERTa, generative LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance. The advancements in generative LLMs are closely intertwined with the development of hardware capabilities. Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance. Therefore, this paper comprehensively surveys efficient generative LLM inference on different hardware platforms. First, we provide an overview of the algorithm architecture of mainstream generative LLMs and delve into the inference process. Then, we summarize different optimization methods for different platforms such as CPU, GPU, FPGA, ASIC, and PIM/NDP, and provide inference results for generative LLMs. Furthermore, we perform a qualitative and quantitative comparison of inference performance with batch sizes 1 and 8 on different hardware platforms by considering hardware power consumption, absolute inference speed (tokens/s), and energy efficiency (tokens/J). We compare the performance of the same optimization methods across different hardware platforms, the performance across different hardware platforms, and the performance of different methods on the same hardware platform. This provides a systematic and comprehensive summary of existing inference acceleration work by integrating software optimization methods and hardware platforms. We point out that three trends (multimodality, inference-time compute, and higher inference energy efficiency) are promising to redefine the capabilities of edge artificial intelligence systems. Our project is available at https://dai.sjtu.edu.cn/project.html.
