TokenPowerBench: Benchmarking the Power Consumption of LLM Inference
Chenxu Niu, Wei Zhang, Jie Li, Yongjian Zhao, Tongyang Wang, Xi Wang, Yong Chen
TL;DR
The paper addresses the lack of practical, reproducible benchmarks for measuring power consumption during LLM inference, which dominates energy use in production. It introduces TokenPowerBench, a lightweight, extensible framework with declarative configuration, multi-level telemetry, and a phase-aware metrics pipeline aligned to prefill and decode stages. The authors demonstrate broad model coverage (1B–405B, dense and MoE), multi-node readiness, and detailed parameter sweeps (batch size, context length, parallelism, quantization) to quantify joules per token and related metrics. Open-sourced and scalable, TokenPowerBench enables energy-aware deployment decisions and sustainability analyses for large-scale LLM services.
Abstract
Large language model (LLM) services now answer billions of queries per day, and industry reports show that inference, not training, accounts for more than 90% of total power consumption. However, existing benchmarks focus on either training/fine-tuning or performance of inference and provide little support for power consumption measurement and analysis of inference. We introduce TokenPowerBench, the first lightweight and extensible benchmark designed for LLM-inference power consumption studies. The benchmark combines (i) a declarative configuration interface covering model choice, prompt set, and inference engine, (ii) a measurement layer that captures GPU-, node-, and system-level power without specialized power meters, and (iii) a phase-aligned metrics pipeline that attributes energy to the prefill and decode stages of every request. These elements make it straight-forward to explore the power consumed by an LLM inference run; furthermore, by varying batch size, context length, parallelism strategy and quantization, users can quickly assess how each setting affects joules per token and other energy-efficiency metrics. We evaluate TokenPowerBench on four of the most widely used model series (Llama, Falcon, Qwen, and Mistral). Our experiments cover from 1 billion parameters up to the frontier-scale Llama3-405B model. Furthermore, we release TokenPowerBench as open source to help users to measure power consumption, forecast operating expenses, and meet sustainability targets when deploying LLM services.
