Evaluating the Energy-Efficiency of the Code Generated by LLMs
Md Arman Islam, Devi Varaprasad Jonnala, Ritika Rekhi, Pratik Pokharel, Siddharth Cilamkoti, Asif Imran, Tevfik Kosar, Bekir Turkkan
TL;DR
The paper tackles the problem of energy efficiency in AI-assisted code generation by systematically benchmarking 20 LLMs on 878 LeetCode problems against canonical human-written solutions. It introduces a comprehensive evaluation framework that jointly measures energy consumption, runtime, memory usage, token costs, and generation success across two dataset sets, using identical prompts and fair comparison protocols. Key findings show canonical solutions are consistently more energy-efficient and memory-efficient than LLM-generated code, with energy gaps that can reach large factors for algorithms like Dynamic Programming, Backtracking, and Bit Manipulation; among LLMs, DeepSeek-v3 and GPT-4o often perform best on energy metrics but still lag behind human code. The work highlights the environmental and economic implications of AI-assisted code generation, and it advocates for sustainability-oriented benchmarks and prompting strategies to reduce emissions and resource use in practical software development. Overall, the study provides a rigorous, data-driven argument for integrating green metrics into the design and evaluation of next-generation code-generation tools.
Abstract
As the quality of code generated by Large Language Models (LLMs) improves, their adoption in the software industry for automated code generation continues to grow. Researchers primarily focus on enhancing the functional correctness of the generated code while commonly overlooking its energy efficiency and environmental impact. This paper investigates the energy efficiency of the code generated by 20 popular LLMs for 878 programming problems of varying difficulty levels and diverse algorithmic categories selected from the LeetCode platform by comparing them against canonical human-written solutions. Although LLMs can produce functionally correct results in most cases, our findings show that the performance and energy efficiency of LLM-produced solutions are often far below those of human-written solutions. Among the studied LLMs, DeepSeek-v3 and GPT-4o generate the most energy-efficient code, whereas Grok-2 and Gemini-1.5-Pro are among the least energy-efficient models. On average, human-generated canonical solutions are approximately 1.17 times more energy efficient than DeepSeek-v3, 1.21 times more energy efficient than GPT-4o, and over 2 times more energy efficient than Grok-2 and Gemini-1.5-Pro. For specific algorithmic groups such as dynamic programming, backtracking, and bit manipulation, LLM-generated code can consume up to 450 times more energy than human-generated canonical solutions.
