Towards Green AI: Decoding the Energy of LLM Inference in Software Development
Lola Solovyeva, Fernando Castor
TL;DR
This study analyzes the energy cost of LLM inference in software development by separating the inference workflow into prefill and decoding phases. Using ten decoder-only LLMs across two size groups and five workloads on code-generation and code-understanding benchmarks, it shows that decoding typically dominates energy use, but prefill length also inflates decoding costs via larger key-value caches. The authors identify babbling as an inefficiency in some models and demonstrate a babbling-suppression approach that reduces energy by up to 89% without harming accuracy. They conclude that effective energy optimization should target both reducing unnecessary output and mitigating prefill-induced decoding costs, with practical implications for prompt length and cache management in deployment. $E_{prefill}$, $E_{tok}$, and $E_{tok,dec}$ are central metrics used to quantify phase-specific energy dynamics across workloads and architectures.
Abstract
Context: AI-assisted tools are increasingly integrated into software development workflows, but their reliance on large language models (LLMs) introduces substantial computational and energy costs. Understanding and reducing the energy footprint of LLM inference is therefore essential for sustainable software development. Objective: In this study, we conduct a phase-level analysis of LLM inference energy consumption, distinguishing between the (1) prefill, where the model processes the input and builds internal representations, and (2) decoding, where output tokens are generated using the stored state. Method: We investigate six 6B-7B and four 3B-4B transformer-based models, evaluating them on code-centric benchmarks HumanEval for code generation and LongBench for code understanding. Results: Our findings show that, within both parameter groups, models exhibit distinct energy patterns across phases. Furthermore, we observed that increases in prefill cost amplify the energy cost per token during decoding, with amplifications ranging from 1.3% to 51.8% depending on the model. Lastly, three out of ten models demonstrate babbling behavior, adding excessive content to the output that unnecessarily inflates energy consumption. We implemented babbling suppression for code generation, achieving energy savings ranging from 44% to 89% without affecting generation accuracy. Conclusion: These findings show that prefill costs influence decoding, which dominates energy consumption, and that babbling suppression can yield up to 89% energy savings. Reducing inference energy therefore requires both mitigating babbling behavior and limiting impact of prefill on decoding.
