Table of Contents
Fetching ...

Towards Green AI: Decoding the Energy of LLM Inference in Software Development

Lola Solovyeva, Fernando Castor

TL;DR

This study analyzes the energy cost of LLM inference in software development by separating the inference workflow into prefill and decoding phases. Using ten decoder-only LLMs across two size groups and five workloads on code-generation and code-understanding benchmarks, it shows that decoding typically dominates energy use, but prefill length also inflates decoding costs via larger key-value caches. The authors identify babbling as an inefficiency in some models and demonstrate a babbling-suppression approach that reduces energy by up to 89% without harming accuracy. They conclude that effective energy optimization should target both reducing unnecessary output and mitigating prefill-induced decoding costs, with practical implications for prompt length and cache management in deployment. $E_{prefill}$, $E_{tok}$, and $E_{tok,dec}$ are central metrics used to quantify phase-specific energy dynamics across workloads and architectures.

Abstract

Context: AI-assisted tools are increasingly integrated into software development workflows, but their reliance on large language models (LLMs) introduces substantial computational and energy costs. Understanding and reducing the energy footprint of LLM inference is therefore essential for sustainable software development. Objective: In this study, we conduct a phase-level analysis of LLM inference energy consumption, distinguishing between the (1) prefill, where the model processes the input and builds internal representations, and (2) decoding, where output tokens are generated using the stored state. Method: We investigate six 6B-7B and four 3B-4B transformer-based models, evaluating them on code-centric benchmarks HumanEval for code generation and LongBench for code understanding. Results: Our findings show that, within both parameter groups, models exhibit distinct energy patterns across phases. Furthermore, we observed that increases in prefill cost amplify the energy cost per token during decoding, with amplifications ranging from 1.3% to 51.8% depending on the model. Lastly, three out of ten models demonstrate babbling behavior, adding excessive content to the output that unnecessarily inflates energy consumption. We implemented babbling suppression for code generation, achieving energy savings ranging from 44% to 89% without affecting generation accuracy. Conclusion: These findings show that prefill costs influence decoding, which dominates energy consumption, and that babbling suppression can yield up to 89% energy savings. Reducing inference energy therefore requires both mitigating babbling behavior and limiting impact of prefill on decoding.

Towards Green AI: Decoding the Energy of LLM Inference in Software Development

TL;DR

This study analyzes the energy cost of LLM inference in software development by separating the inference workflow into prefill and decoding phases. Using ten decoder-only LLMs across two size groups and five workloads on code-generation and code-understanding benchmarks, it shows that decoding typically dominates energy use, but prefill length also inflates decoding costs via larger key-value caches. The authors identify babbling as an inefficiency in some models and demonstrate a babbling-suppression approach that reduces energy by up to 89% without harming accuracy. They conclude that effective energy optimization should target both reducing unnecessary output and mitigating prefill-induced decoding costs, with practical implications for prompt length and cache management in deployment. , , and are central metrics used to quantify phase-specific energy dynamics across workloads and architectures.

Abstract

Context: AI-assisted tools are increasingly integrated into software development workflows, but their reliance on large language models (LLMs) introduces substantial computational and energy costs. Understanding and reducing the energy footprint of LLM inference is therefore essential for sustainable software development. Objective: In this study, we conduct a phase-level analysis of LLM inference energy consumption, distinguishing between the (1) prefill, where the model processes the input and builds internal representations, and (2) decoding, where output tokens are generated using the stored state. Method: We investigate six 6B-7B and four 3B-4B transformer-based models, evaluating them on code-centric benchmarks HumanEval for code generation and LongBench for code understanding. Results: Our findings show that, within both parameter groups, models exhibit distinct energy patterns across phases. Furthermore, we observed that increases in prefill cost amplify the energy cost per token during decoding, with amplifications ranging from 1.3% to 51.8% depending on the model. Lastly, three out of ten models demonstrate babbling behavior, adding excessive content to the output that unnecessarily inflates energy consumption. We implemented babbling suppression for code generation, achieving energy savings ranging from 44% to 89% without affecting generation accuracy. Conclusion: These findings show that prefill costs influence decoding, which dominates energy consumption, and that babbling suppression can yield up to 89% energy savings. Reducing inference energy therefore requires both mitigating babbling behavior and limiting impact of prefill on decoding.
Paper Structure (15 sections, 5 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Simplified methodology proposed by Babakol et al. methodology_babakol, aligning token generation and energy measurements based on timestamps. Each color represents a different token.
  • Figure 2: Energy consumption per token during token generation for CodeLlama-7B for 0-shot CoT. The x-axis shows token index and y-axis energy consumption for that token.
  • Figure 3: The plots illustrate the relationship between input size and energy consumption during the prefill phase (left) and per token in the decoding phase (right).
  • Figure 4: The plots illustrate token generation during inference, with the x-axis representing the index of each generated token and the y-axis representing the energy consumption for that token. The examples correspond to code understanding with long output.
  • Figure 5: The plots illustrate token generation during inference, with the x-axis representing the index of each generated token and the y-axis representing the energy consumption for that token. The examples shown correspond to 0-shot prompting with chain-of-thought.