Table of Contents
Fetching ...

Quantizing Large Language Models for Code Generation: A Differentiated Replication

Alessandro Giagnorio, Antonio Mastropaolo, Saima Afrin, Massimiliano Di Penta, Gabriele Bavota

TL;DR

The paper investigates how Additive Quantization with Learned Multi-Codebooks (AQLM) can compress large code-focused LLMs (CodeLlama and DeepSeek-Coder up to 34B) for code generation without substantial performance loss. By evaluating 8-, 4-, 3-, and 2-bit quantization across Python and Java tasks using multiple calibration datasets, the authors find that 4-bit quantization yields about a 70% memory reduction with minimal impact on pass@1, while 3- and 2-bit quantization require code-centric calibration and larger models to mitigate losses. The study extends prior work by incorporating larger models, more languages, and state-of-the-art extreme quantization techniques, and demonstrates that extreme quantization becomes more viable as model size grows. The work provides empirical guidance for deploying efficient, locally runnable code-LLMs and releases replication artifacts to support further research in efficient software engineering AI pipelines.

Abstract

Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code. However, when it comes to deploying LLM-based code generators, larger LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. A previous work by Wei et al. proposed to leverage quantization techniques to reduce the memory footprint of LLM-based code generators without substantially degrading their effectiveness. In short, they studied LLMs featuring up to 16B parameters, quantizing their precision from floating point 32 bits down to int 8 bits and showing their limited impact on code generation performance. Given the fast pace at which LLM capabilities and quantization techniques are evolving, in this work we present a differentiated replication of the work by Wei et al. in which we consider (i) on the one side, more recent and larger code-related LLMs, of up to 34B parameters; (ii) the latest advancements in model quantization techniques, which allow pushing the compression to the extreme quantization level of 2 bits per model parameter and; (iii) different types of calibration datasets to guide the quantization process, including code-specific ones. Our empirical evaluation reveals that the new frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70% compared to the original model without observing any significant decrease in performance. Additionally, when the quantization becomes even more extreme (3 and 2 bits), a code-specific calibration dataset helps to limit the loss of performance.

Quantizing Large Language Models for Code Generation: A Differentiated Replication

TL;DR

The paper investigates how Additive Quantization with Learned Multi-Codebooks (AQLM) can compress large code-focused LLMs (CodeLlama and DeepSeek-Coder up to 34B) for code generation without substantial performance loss. By evaluating 8-, 4-, 3-, and 2-bit quantization across Python and Java tasks using multiple calibration datasets, the authors find that 4-bit quantization yields about a 70% memory reduction with minimal impact on pass@1, while 3- and 2-bit quantization require code-centric calibration and larger models to mitigate losses. The study extends prior work by incorporating larger models, more languages, and state-of-the-art extreme quantization techniques, and demonstrates that extreme quantization becomes more viable as model size grows. The work provides empirical guidance for deploying efficient, locally runnable code-LLMs and releases replication artifacts to support further research in efficient software engineering AI pipelines.

Abstract

Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code. However, when it comes to deploying LLM-based code generators, larger LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. A previous work by Wei et al. proposed to leverage quantization techniques to reduce the memory footprint of LLM-based code generators without substantially degrading their effectiveness. In short, they studied LLMs featuring up to 16B parameters, quantizing their precision from floating point 32 bits down to int 8 bits and showing their limited impact on code generation performance. Given the fast pace at which LLM capabilities and quantization techniques are evolving, in this work we present a differentiated replication of the work by Wei et al. in which we consider (i) on the one side, more recent and larger code-related LLMs, of up to 34B parameters; (ii) the latest advancements in model quantization techniques, which allow pushing the compression to the extreme quantization level of 2 bits per model parameter and; (iii) different types of calibration datasets to guide the quantization process, including code-specific ones. Our empirical evaluation reveals that the new frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70% compared to the original model without observing any significant decrease in performance. Additionally, when the quantization becomes even more extreme (3 and 2 bits), a code-specific calibration dataset helps to limit the loss of performance.

Paper Structure

This paper contains 19 sections, 1 equation, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Accuracy loss and memory saving of the quantized models compared to the baseline model for the Python (left) and Java (right) benchmarks. The dotted red lines highlight the models' performance after the 4-bit quantization.