Table of Contents
Fetching ...

Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy

Negar Alizadeh, Boris Belchev, Nishant Saurabh, Patricia Kelbert, Fernando Castor

TL;DR

The findings reveal that employing a big LLM with a higher energy budget does not always translate to significantly improved accuracy, and quantized versions of large models generally offer better efficiency and accuracy compared to full-precision versions of medium-sized ones.

Abstract

The use of generative AI-based coding assistants like ChatGPT and Github Copilot is a reality in contemporary software development. Many of these tools are provided as remote APIs. Using third-party APIs raises data privacy and security concerns for client companies, which motivates the use of locally-deployed language models. In this study, we explore the trade-off between model accuracy and energy consumption, aiming to provide valuable insights to help developers make informed decisions when selecting a language model. We investigate the performance of 18 families of LLMs in typical software development tasks on two real-world infrastructures, a commodity GPU and a powerful AI-specific GPU. Given that deploying LLMs locally requires powerful infrastructure which might not be affordable for everyone, we consider both full-precision and quantized models. Our findings reveal that employing a big LLM with a higher energy budget does not always translate to significantly improved accuracy. Additionally, quantized versions of large models generally offer better efficiency and accuracy compared to full-precision versions of medium-sized ones. Apart from that, not a single model is suitable for all types of software development tasks.

Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy

TL;DR

The findings reveal that employing a big LLM with a higher energy budget does not always translate to significantly improved accuracy, and quantized versions of large models generally offer better efficiency and accuracy compared to full-precision versions of medium-sized ones.

Abstract

The use of generative AI-based coding assistants like ChatGPT and Github Copilot is a reality in contemporary software development. Many of these tools are provided as remote APIs. Using third-party APIs raises data privacy and security concerns for client companies, which motivates the use of locally-deployed language models. In this study, we explore the trade-off between model accuracy and energy consumption, aiming to provide valuable insights to help developers make informed decisions when selecting a language model. We investigate the performance of 18 families of LLMs in typical software development tasks on two real-world infrastructures, a commodity GPU and a powerful AI-specific GPU. Given that deploying LLMs locally requires powerful infrastructure which might not be affordable for everyone, we consider both full-precision and quantized models. Our findings reveal that employing a big LLM with a higher energy budget does not always translate to significantly improved accuracy. Additionally, quantized versions of large models generally offer better efficiency and accuracy compared to full-precision versions of medium-sized ones. Apart from that, not a single model is suitable for all types of software development tasks.

Paper Structure

This paper contains 15 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Schematic representation of the study. The circles represent the models of the NVIDIA GPUs available on the two machines where the experiments were run. The experiments involved quantized and full-precision models. We measured the energy footprint of these models when performing four different tasks, using the workload provided by the HumanEvalPack benchmark.
  • Figure 2: Energy (Wh) consumed by the A100 GPU when performing each task. Model names are presented according to the pattern family-QQ, where family includes the family name (phi3, gemma, llama3, etc.) and the number of parameters, in billions (20b, 3.8b, etc.), and QQ is the quantization level (4 bit, 8 bit, or full precision 16 bits).
  • Figure 3: Energy (Wh) consumed by the RTX 3070 GPU and the Intel Core i7 CPU when performing each task. The lower part of each bar represents GPU energy, and the upper part represents CPU energy. Model names follow the pattern of Figure \ref{['fig:total_energy_a100']}.
  • Figure 5: Spearman's correlation matrix for all models across all tasks on GPU A100 ($p-value < 0.0016$)
  • Figure : (a) Code Generation
  • ...and 4 more figures