Table of Contents
Fetching ...

AI-Powered, But Power-Hungry? Energy Efficiency of LLM-Generated Code

Lola Solovyeva, Sophie Weidmann, Fernando Castor

TL;DR

This work assesses the energy efficiency and performance of code produced by large language models across Python, Java, and C++ on macOS and Ubuntu, using Copilot, GPT-4o, and OpenAI o1-mini with hard LeetCode problems. A human-written LeetCode Hard baseline (159 solutions across 53 problems) anchors comparisons, while 477 LLM-generated solutions are evaluated for energy use, runtime, and correctness. Results show Python yields the best pass@1 accuracy and often matches or outperforms the baseline in energy efficiency, whereas Java and especially C++ frequently incur higher energy costs; o1-mini delivers accuracy gains but at higher energy expense, with energy footprints exhibiting cross-platform consistency. The findings imply that while LLMs can produce energy-aware code in some languages, caution is needed for performance-critical languages, and platform-agnostic patterns may emerge for generated code, informing practitioners to consider energy costs alongside correctness in AI-assisted software development.

Abstract

Large language models (LLMs) are used in software development to assist in various tasks, e.g., code generation and code completion, but empirical evaluations of the quality of the results produced by these models focus on correctness and ignore other relevant aspects, such as their performance and energy efficiency. Studying the performance of LLM-produced programs is essential to understand how well LLMs can support the construction of performance- and energy-critical software, such as operating systems, servers, and mobile applications. This paper presents the first study analyzing the energy efficiency and performance of LLM-generated code for three programming languages Python, Java, and C++, on two platforms, a Mac and a PC, leveraging three frontier LLMs, Github Copilot, GPT-4o, and the recently-released OpenAI o1-mini, and targeting ``hard'' programming problems from LeetCode. Our results show that the models are much more successful in generating Python and Java than C++ code.

AI-Powered, But Power-Hungry? Energy Efficiency of LLM-Generated Code

TL;DR

This work assesses the energy efficiency and performance of code produced by large language models across Python, Java, and C++ on macOS and Ubuntu, using Copilot, GPT-4o, and OpenAI o1-mini with hard LeetCode problems. A human-written LeetCode Hard baseline (159 solutions across 53 problems) anchors comparisons, while 477 LLM-generated solutions are evaluated for energy use, runtime, and correctness. Results show Python yields the best pass@1 accuracy and often matches or outperforms the baseline in energy efficiency, whereas Java and especially C++ frequently incur higher energy costs; o1-mini delivers accuracy gains but at higher energy expense, with energy footprints exhibiting cross-platform consistency. The findings imply that while LLMs can produce energy-aware code in some languages, caution is needed for performance-critical languages, and platform-agnostic patterns may emerge for generated code, informing practitioners to consider energy costs alongside correctness in AI-assisted software development.

Abstract

Large language models (LLMs) are used in software development to assist in various tasks, e.g., code generation and code completion, but empirical evaluations of the quality of the results produced by these models focus on correctness and ignore other relevant aspects, such as their performance and energy efficiency. Studying the performance of LLM-produced programs is essential to understand how well LLMs can support the construction of performance- and energy-critical software, such as operating systems, servers, and mobile applications. This paper presents the first study analyzing the energy efficiency and performance of LLM-generated code for three programming languages Python, Java, and C++, on two platforms, a Mac and a PC, leveraging three frontier LLMs, Github Copilot, GPT-4o, and the recently-released OpenAI o1-mini, and targeting ``hard'' programming problems from LeetCode. Our results show that the models are much more successful in generating Python and Java than C++ code.

Paper Structure

This paper contains 16 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Each subplot illustrates the average energy consumption required to complete programming problems within each category (x-axis) for a specific programming language (Python, Java, or C++). The x-axis categories are arranged in alphabetical order. The results are displayed only for those programming problems that provided working solutions for all models. The left y-axis represents the energy consumption (in Joules) on Ubuntu, while the right y-axis represents the energy consumption (in Joules) on macOS. Scales on the y-axis are different for the three languages. The legend applies to all subplots and describes the data points for both Ubuntu and macOS.
  • Figure 2: Illustration of the Spearman correlation between the energy consumption results for solutions generated by each model across two platforms.
  • Figure 3: Example solutions for the First Missing Positive problem. The left solution was generated by OpenAI's o1-mini model, while the right solution represents the highest-rated human implementation from LeetCode.