Table of Contents
Fetching ...

Deriving Coding-Specific Sub-Models from LLMs using Resource-Efficient Pruning

Laura Puccioni, Alireza Farshin, Mariano Scazzariello, Changjie Wang, Marco Chiesa, Dejan Kostic

TL;DR

This work tackles the resource demands of code-capable LLMs by deriving domain-specific sub-models through unstructured pruning. It extends Wanda with domain- and language-specific calibration data to produce programming-language sub-models for Python, Java, C++, and JavaScript across multiple base models, plus math, CSR, and translation tasks. Evaluations using pass@k on MBPP, METEOR for translation, and GSM8K (8-shot) show that domain-tuned pruning yields higher domain accuracy and that language-specific sub-models preserve acceptable performance relative to full models. Structural analyses reveal distinct weight-masks across domains, supporting the claim that domain-specific tasks activate different regions within LLMs and enabling more efficient, local-code execution on consumer hardware.

Abstract

Large Language Models (LLMs) have demonstrated their exceptional performance in various complex code generation tasks. However, their broader adoption is limited by significant computational demands and high resource requirements, particularly memory and processing power. To mitigate such requirements, model pruning techniques are used to create more compact models with significantly fewer parameters. However, current approaches do not focus on the efficient extraction of programming-language-specific sub-models. In this work, we explore the idea of efficiently deriving coding-specific sub-models through unstructured pruning (i.e., Wanda). We investigate the impact of different domain-specific calibration datasets on pruning outcomes across three distinct domains and extend our analysis to extracting four language-specific sub-models: Python, Java, C++, and JavaScript. We are the first to efficiently extract programming-language-specific sub-models using appropriate calibration datasets while maintaining acceptable accuracy w.r.t. full models. We are also the first to provide analytical evidence that domain-specific tasks activate distinct regions within LLMs, supporting the creation of specialized sub-models through unstructured pruning. We believe that this work has significant potential to enhance LLM accessibility for coding by reducing computational requirements to enable local execution on consumer-grade hardware, and supporting faster inference times critical for real-time development feedback.

Deriving Coding-Specific Sub-Models from LLMs using Resource-Efficient Pruning

TL;DR

This work tackles the resource demands of code-capable LLMs by deriving domain-specific sub-models through unstructured pruning. It extends Wanda with domain- and language-specific calibration data to produce programming-language sub-models for Python, Java, C++, and JavaScript across multiple base models, plus math, CSR, and translation tasks. Evaluations using pass@k on MBPP, METEOR for translation, and GSM8K (8-shot) show that domain-tuned pruning yields higher domain accuracy and that language-specific sub-models preserve acceptable performance relative to full models. Structural analyses reveal distinct weight-masks across domains, supporting the claim that domain-specific tasks activate different regions within LLMs and enabling more efficient, local-code execution on consumer hardware.

Abstract

Large Language Models (LLMs) have demonstrated their exceptional performance in various complex code generation tasks. However, their broader adoption is limited by significant computational demands and high resource requirements, particularly memory and processing power. To mitigate such requirements, model pruning techniques are used to create more compact models with significantly fewer parameters. However, current approaches do not focus on the efficient extraction of programming-language-specific sub-models. In this work, we explore the idea of efficiently deriving coding-specific sub-models through unstructured pruning (i.e., Wanda). We investigate the impact of different domain-specific calibration datasets on pruning outcomes across three distinct domains and extend our analysis to extracting four language-specific sub-models: Python, Java, C++, and JavaScript. We are the first to efficiently extract programming-language-specific sub-models using appropriate calibration datasets while maintaining acceptable accuracy w.r.t. full models. We are also the first to provide analytical evidence that domain-specific tasks activate distinct regions within LLMs, supporting the creation of specialized sub-models through unstructured pruning. We believe that this work has significant potential to enhance LLM accessibility for coding by reducing computational requirements to enable local execution on consumer-grade hardware, and supporting faster inference times critical for real-time development feedback.
Paper Structure (7 sections, 4 figures, 4 tables)

This paper contains 7 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Our methodology for domain-specific-LLMs extraction.
  • Figure 2: Comparison of code-task accuracy for non-coding sub-models and of math-task accuracy for non-math sub-models.
  • Figure 3: The first two plots (left and center) represent the visualization of the q_proj weight matrices of layer 12 of two sub-models obtained by pruning Gemma on two different datasets. The last plot (right) represents the weights that differ between the two matrices (red dots).
  • Figure 4: Jaccard distance of Gemma pruned on (a) math_qa (math) and math_orca (math), and (b) math_qa (math) and opus_books (translation).