From Prompts to Power: Measuring the Energy Footprint of LLM Inference
Francisco Caravaca, Ángel Cuevas, Rubén Cuevas
TL;DR
This work tackles the energy footprint of LLM inference by conducting a large-scale, measurement-driven study across 21 GPU configurations and 155 architectures using the vLLM engine, enabling per-prompt energy quantification. It builds a generalist predictive model that extends to unseen architectures and implements it in a browser extension to raise awareness of environmental impact. Key findings show that energy scales with output length, batch size, and hardware specifics, while model size alone is not sufficient to predict energy; memory bottlenecks like KV cache capacity can dominate in certain regimes. The proposed approach provides actionable guidance for deployment optimization and sustainability, offering a practical tool to compare configurations and inform design choices for greener AI systems.
Abstract
The rapid expansion of Large Language Models (LLMs) has introduced unprecedented energy demands, extending beyond training to large-scale inference workloads that often dominate total lifecycle consumption. Deploying these models requires energy-intensive GPU infrastructure, and in some cases has even prompted plans to power data centers with nuclear energy. Despite this growing relevance, systematic analyses of inference energy consumption remain limited. In this work, we present a large-scale measurement-based study comprising over 32,500 measurements across 21 GPU configurations and 155 model architectures, from small open-source models to frontier systems. Using the vLLM inference engine, we quantify energy usage at the prompt level and identify how architectural and operational factors shape energy demand. Building on these insights, we develop a predictive model that accurately estimates inference energy consumption across unseen architectures and hardware, and implement it as a browser extension to raise awareness of the environmental impact of generative AI.
