Table of Contents
Fetching ...

From Prompts to Power: Measuring the Energy Footprint of LLM Inference

Francisco Caravaca, Ángel Cuevas, Rubén Cuevas

TL;DR

This work tackles the energy footprint of LLM inference by conducting a large-scale, measurement-driven study across 21 GPU configurations and 155 architectures using the vLLM engine, enabling per-prompt energy quantification. It builds a generalist predictive model that extends to unseen architectures and implements it in a browser extension to raise awareness of environmental impact. Key findings show that energy scales with output length, batch size, and hardware specifics, while model size alone is not sufficient to predict energy; memory bottlenecks like KV cache capacity can dominate in certain regimes. The proposed approach provides actionable guidance for deployment optimization and sustainability, offering a practical tool to compare configurations and inform design choices for greener AI systems.

Abstract

The rapid expansion of Large Language Models (LLMs) has introduced unprecedented energy demands, extending beyond training to large-scale inference workloads that often dominate total lifecycle consumption. Deploying these models requires energy-intensive GPU infrastructure, and in some cases has even prompted plans to power data centers with nuclear energy. Despite this growing relevance, systematic analyses of inference energy consumption remain limited. In this work, we present a large-scale measurement-based study comprising over 32,500 measurements across 21 GPU configurations and 155 model architectures, from small open-source models to frontier systems. Using the vLLM inference engine, we quantify energy usage at the prompt level and identify how architectural and operational factors shape energy demand. Building on these insights, we develop a predictive model that accurately estimates inference energy consumption across unseen architectures and hardware, and implement it as a browser extension to raise awareness of the environmental impact of generative AI.

From Prompts to Power: Measuring the Energy Footprint of LLM Inference

TL;DR

This work tackles the energy footprint of LLM inference by conducting a large-scale, measurement-driven study across 21 GPU configurations and 155 architectures using the vLLM engine, enabling per-prompt energy quantification. It builds a generalist predictive model that extends to unseen architectures and implements it in a browser extension to raise awareness of environmental impact. Key findings show that energy scales with output length, batch size, and hardware specifics, while model size alone is not sufficient to predict energy; memory bottlenecks like KV cache capacity can dominate in certain regimes. The proposed approach provides actionable guidance for deployment optimization and sustainability, offering a practical tool to compare configurations and inform design choices for greener AI systems.

Abstract

The rapid expansion of Large Language Models (LLMs) has introduced unprecedented energy demands, extending beyond training to large-scale inference workloads that often dominate total lifecycle consumption. Deploying these models requires energy-intensive GPU infrastructure, and in some cases has even prompted plans to power data centers with nuclear energy. Despite this growing relevance, systematic analyses of inference energy consumption remain limited. In this work, we present a large-scale measurement-based study comprising over 32,500 measurements across 21 GPU configurations and 155 model architectures, from small open-source models to frontier systems. Using the vLLM inference engine, we quantify energy usage at the prompt level and identify how architectural and operational factors shape energy demand. Building on these insights, we develop a predictive model that accurately estimates inference energy consumption across unseen architectures and hardware, and implement it as a browser extension to raise awareness of the environmental impact of generative AI.

Paper Structure

This paper contains 33 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: The effect of increasing the amount of input and output tokens in different models using an NVIDIA A100 80GB as accelerator.
  • Figure 2:
  • Figure 3: Energy consumption by number of prompts. The plot shows how GPU energy per prompt decreases with increasing batch size. Each prompt contains 300 $T_{input}$ and 300 $T_{out}$.
  • Figure 4: Model Parameters against energy consumed. This plot shows the energy consumed using LLMs by prompt using prompts with 500 input and output tokens.
  • Figure 5: Energy consumption by number of layers in the Gemma 7B model. We selected two models and modified the amount of layers, these experiments were run in 2 NVIDIA A100 80GB running the models with BF16.
  • ...and 9 more figures