Table of Contents
Fetching ...

ALPINE: An adaptive language-agnostic pruning method for language models for code

Mootez Saad, José Antonio Hernández López, Boqi Chen, Dániel Varró, Tushar Sharma

TL;DR

ALPINE tackles the resource-intensity of language models for code by introducing an adaptive token-pruning technique that is language-agnostic and plug-and-play for Transformer encoders. It computes per-token importance from attention probabilities and prunes tokens outside a dynamic range, reducing input length and FLOPs while preserving performance. Across two SE tasks and three models, ALPINE achieves substantial reductions in FLOPs, memory footprint, and CO2 emissions with minimal accuracy loss, demonstrating practical gains for deploying code-aware LMs on consumer-grade hardware. This work highlights redundancy in source-code corpora and paves the way for more accessible, sustainable software engineering with transformer-based models.

Abstract

Language models of code have demonstrated state-of-the-art performance across various software engineering and source code analysis tasks. However, their demanding computational resource requirements and consequential environmental footprint remain as significant challenges. This work introduces ALPINE, an adaptive programming language-agnostic pruning technique designed to substantially reduce these models' computational overhead. The proposed method offers a pluggable layer that can be integrated with all Transformer-based models. With ALPINE, input sequences undergo adaptive compression throughout the pipeline, reaching a size up to $\times 3$ less their initial size, resulting in significantly reduced computational load. Our experiments on two software engineering tasks, defect prediction and code clone detection across three language models CodeBERT, GraphCodeBERT and UniXCoder show that ALPINE achieves up to a 50% reduction in FLOPs, a 58.1% decrease in memory footprint, and a 28.1% improvement in throughput on average. This led to a reduction in CO2 by up to $44.85$%. Importantly, it achieves the reduction in computation resources while maintaining up to 98.1% of the original predictive performance. These findings highlight the potential of ALPINE in making language models of code more resource-efficient and accessible while preserving their performance, contributing to the overall sustainability of adopting language models in software development. Also, it sheds light on redundant and noisy information in source code analysis corpora, as shown by the substantial sequence compression achieved by ALPINE.

ALPINE: An adaptive language-agnostic pruning method for language models for code

TL;DR

ALPINE tackles the resource-intensity of language models for code by introducing an adaptive token-pruning technique that is language-agnostic and plug-and-play for Transformer encoders. It computes per-token importance from attention probabilities and prunes tokens outside a dynamic range, reducing input length and FLOPs while preserving performance. Across two SE tasks and three models, ALPINE achieves substantial reductions in FLOPs, memory footprint, and CO2 emissions with minimal accuracy loss, demonstrating practical gains for deploying code-aware LMs on consumer-grade hardware. This work highlights redundancy in source-code corpora and paves the way for more accessible, sustainable software engineering with transformer-based models.

Abstract

Language models of code have demonstrated state-of-the-art performance across various software engineering and source code analysis tasks. However, their demanding computational resource requirements and consequential environmental footprint remain as significant challenges. This work introduces ALPINE, an adaptive programming language-agnostic pruning technique designed to substantially reduce these models' computational overhead. The proposed method offers a pluggable layer that can be integrated with all Transformer-based models. With ALPINE, input sequences undergo adaptive compression throughout the pipeline, reaching a size up to less their initial size, resulting in significantly reduced computational load. Our experiments on two software engineering tasks, defect prediction and code clone detection across three language models CodeBERT, GraphCodeBERT and UniXCoder show that ALPINE achieves up to a 50% reduction in FLOPs, a 58.1% decrease in memory footprint, and a 28.1% improvement in throughput on average. This led to a reduction in CO2 by up to %. Importantly, it achieves the reduction in computation resources while maintaining up to 98.1% of the original predictive performance. These findings highlight the potential of ALPINE in making language models of code more resource-efficient and accessible while preserving their performance, contributing to the overall sustainability of adopting language models in software development. Also, it sheds light on redundant and noisy information in source code analysis corpora, as shown by the substantial sequence compression achieved by ALPINE.
Paper Structure (27 sections, 1 theorem, 19 equations, 9 figures, 4 tables, 2 algorithms)

This paper contains 27 sections, 1 theorem, 19 equations, 9 figures, 4 tables, 2 algorithms.

Key Result

Lemma 1

For every conceivable input sequence fed into a Transformer model with hyperparameters same as CodeBert, GraphCodeBert, and UniXCoder, it holds true that

Figures (9)

  • Figure 1: Empirical flops count measured in gflops ($10^{9}$flops) of CodeBert on the test set of the Devign dataset after one forward pass.
  • Figure 2: Overview of a Transformer encoder-based model using Alpine. Tokens that are highlighted in yellow represent special tokens such as [CLS] and [SEP]. Whenever applicable, we include the tensors dimensions for better clarity. In the figure, $B$ refers to the batch size, $S$ is the sequence length, and $dim$ is the hidden dimension of the model. The number of tokens that are pruned is $k$ which would differ from one layer to another.
  • Figure 3: The progressive average reduction in sequences' lengths as the input traverses through the layers of each model. The plots are the result of a forward pass across the whole evaluation set of each dataset with a batch size of 8.
  • Figure 4: Comparison of the gpu memory footprint between pruned and non-pruned models across the tasks. The measurements were conducted on the evaluation sets of each dataset during inference.
  • Figure 5: Fine-tuning time before and after pruning on an NVIDIA A100 gpu.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Lemma 1