Table of Contents
Fetching ...

An exploration of the effect of quantisation on energy consumption and inference time of StarCoder2

Pepijn de Reus, Ana Oprescu, Jelle Zuidema

TL;DR

This study examines quantisation and pruning strategies to reduce energy consumption in code Large Language Models (LLMs) inference using StarCoder2 and suggests future work on hardware-optimized quantization to enhance efficiency with minimal loss in accuracy.

Abstract

This study examines quantisation and pruning strategies to reduce energy consumption in code Large Language Models (LLMs) inference. Using StarCoder2, we observe increased energy demands with quantization due to lower throughput and some accuracy losses. Conversely, pruning reduces energy usage but impairs performance. The results highlight challenges and trade-offs in LLM model compression. We suggest future work on hardware-optimized quantization to enhance efficiency with minimal loss in accuracy.

An exploration of the effect of quantisation on energy consumption and inference time of StarCoder2

TL;DR

This study examines quantisation and pruning strategies to reduce energy consumption in code Large Language Models (LLMs) inference using StarCoder2 and suggests future work on hardware-optimized quantization to enhance efficiency with minimal loss in accuracy.

Abstract

This study examines quantisation and pruning strategies to reduce energy consumption in code Large Language Models (LLMs) inference. Using StarCoder2, we observe increased energy demands with quantization due to lower throughput and some accuracy losses. Conversely, pruning reduces energy usage but impairs performance. The results highlight challenges and trade-offs in LLM model compression. We suggest future work on hardware-optimized quantization to enhance efficiency with minimal loss in accuracy.

Paper Structure

This paper contains 29 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Graph depicting the logarithmic growth of LLMs since GPT-1 in 2018 until 2021. Obtained from han_pre-trained_2021.
  • Figure 2: Overview of our evaluation framework. Prompts are fed into the code LLM, the output is then pre-processed after which the output is evaluated. The top-$k$ outputs are and if at least one implementation passes, the pass@$k$ score is 1.
  • Figure 3: Average energy consumption of the StarCoder2-3B model on five runs, predicting 128 new tokens. The bars indicate the original model and the quantised versions of 4-bit and 8-bit. The whiskers display the 95% confidence interval (1.96 $\cdot$ std. dev.).
  • Figure 4: Average energy consumption of the StarCoder2-3B model on five runs, predicting 256 new tokens. The bars indicate the original model and the quantised versions of 4-bit and 8-bit. The whiskers display the 95% confidence interval (1.96 $\cdot$ std. dev.).
  • Figure 5: Average energy consumption of the StarCoder2-7B model on five runs, predicting 128 new tokens. The bars indicate the original model and the quantised versions of 4-bit and 8-bit. The whiskers display the 95% confidence interval (1.96 $\cdot$ std. dev.).
  • ...and 3 more figures