Smart but Costly? Benchmarking LLMs on Functional Accuracy and Energy Efficiency
Mohammadjavad Mehditabar, Saurabhsingh Rajput, Antonio Mastropaolo, Tushar Sharma
TL;DR
BRACE introduces a principled framework to benchmark Code Language Models on the joint dimensions of energy efficiency and functional accuracy. It proposes two novel rating methods, CIRC and OTER, to produce interpretable 1–5 scores and validates them on 22 CLMs across code generation and code summarization tasks, using LiveCodeBench and CodeXGLUE. The study finds that code summarization tasks tend to be easier for energy-efficient models, that model size alone does not predict ratings, and that CIRC provides deterministic comparisons while OTER captures more nuanced energy–accuracy trade-offs. Practically, BRACE enables evidence-based, deployment-aware model selection and offers a foundation for extending energy–efficiency analysis to broader software-engineering contexts and beyond.
Abstract
The rapid advancement of AI technologies and their accelerated adoption in software development necessitates a systematic evaluation of their environmental impact alongside functional correctness. While prior studies have examined sustainability in large language models, existing approaches lack systematic frameworks for evaluating accuracy-energy trade-offs in Code Language Models (CLMs). In this paper, we present a framework, BRACE, to benchmark CLMs on a unified scale of energy efficiency and functional correctness (referred to as accuracy). We benchmark 22 state-of-the-art models on code generation and summarization tasks, proposing two rating methods: Concentric Incremental Rating Circles (CIRC) and Observation to Expectation Rating (OTER). CIRC provides deterministic Euclidean-based rankings with static trade-offs that are robust to outliers, and OTER offers trend-aware evaluation with dynamic trade-offs that capture the complex correlation between energy and accuracy, each offering a distinct perspective and addressing the problem in a unique way. These rating methods enable us to rate LLMs on a 1-5 scale reflecting their combined capabilities in terms of energy efficiency and functional correctness. Our analysis reveals models generally perform better in the code summarization tasks as they are not enforced to generate a grammar-based and syntactically correct output. Also, we find that models' size does not have a significant impact on their ratings, indicating that if models utilize their parameters efficiently, they can be ranked higher on these scales. The proposed BRACE framework empowers practitioners to make evidence-based model selections that balance sustainability with task requirements, guiding rating choice -- CIRC for deterministic comparisons or OTER for trend-aware evaluation -- based on deployment priorities.
