Table of Contents
Fetching ...

SparseCoder: Advancing Source Code Analysis with Sparse Attention and Learned Token Pruning

Xueqi Yang, Mariusz Jakubowski, Li Kang, Haojie Yu, Tim Menzies

TL;DR

This paper introduces SparseCoder, an innovative approach incorporating sparse attention and learned token pruning (LTP) method (adapted from natural language processing) to address this limitation of Transformer-based approaches with long code sequences.

Abstract

As software projects rapidly evolve, software artifacts become more complex and defects behind get harder to identify. The emerging Transformer-based approaches, though achieving remarkable performance, struggle with long code sequences due to their self-attention mechanism, which scales quadratically with the sequence length. This paper introduces SparseCoder, an innovative approach incorporating sparse attention and learned token pruning (LTP) method (adapted from natural language processing) to address this limitation. Compared to previous state-of-the-art models CodeBERT, RoBERTa, and CodeT5, our experiments demonstrate that SparseCoder can handle significantly longer input sequences--at least twice as long, within the limits of our hardware resources and data statistics. Additionally, SparseCoder is four times faster than other methods measured in runtime, achieving a 50% reduction in floating point operations per second (FLOPs) with a negligible performance drop of less than 1% compared to Transformers using sparse attention (Sparse Atten). Plotting FLOPs of model inference against token lengths reveals that SparseCoder scales linearly, whereas other methods, including the current state-of-the-art model CodeT5, scale quadratically. Moreover, SparseCoder enhances interpretability by visualizing non-trivial tokens layer-wise.

SparseCoder: Advancing Source Code Analysis with Sparse Attention and Learned Token Pruning

TL;DR

This paper introduces SparseCoder, an innovative approach incorporating sparse attention and learned token pruning (LTP) method (adapted from natural language processing) to address this limitation of Transformer-based approaches with long code sequences.

Abstract

As software projects rapidly evolve, software artifacts become more complex and defects behind get harder to identify. The emerging Transformer-based approaches, though achieving remarkable performance, struggle with long code sequences due to their self-attention mechanism, which scales quadratically with the sequence length. This paper introduces SparseCoder, an innovative approach incorporating sparse attention and learned token pruning (LTP) method (adapted from natural language processing) to address this limitation. Compared to previous state-of-the-art models CodeBERT, RoBERTa, and CodeT5, our experiments demonstrate that SparseCoder can handle significantly longer input sequences--at least twice as long, within the limits of our hardware resources and data statistics. Additionally, SparseCoder is four times faster than other methods measured in runtime, achieving a 50% reduction in floating point operations per second (FLOPs) with a negligible performance drop of less than 1% compared to Transformers using sparse attention (Sparse Atten). Plotting FLOPs of model inference against token lengths reveals that SparseCoder scales linearly, whereas other methods, including the current state-of-the-art model CodeT5, scale quadratically. Moreover, SparseCoder enhances interpretability by visualizing non-trivial tokens layer-wise.
Paper Structure (38 sections, 9 figures, 5 tables)

This paper contains 38 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: An example of SparseCoder. Token pruning on accumulative attention matrices, where full attention depicted in (a) and sparse attention visualized in (b). The accumulation is conducted vertically after row-wise softmax as formulated in Transformer models. Given a pre-defined threshold as 0.5, tokens marked with ✗ in both (a) and (b) denote trivial words pruned away since their accumulative attention scores fall below the threshold. (a) details the token pruning process on a single self-attention layer of Transformer model as demonstrated by Kim et al. kim2021learned, and (b) delves into the token pruning process within our sparse attention layer of SparseCoder, achieving greater computational efficiency through a sliding window strategy with a window size of three. Finally, (c) visualizes token pruning post-multiple attention layers, demonstrating the elimination of trivial tokens.
  • Figure 2: Pipeline of utilizing natural language models in downstream tasks. In our empirical study, only the fine-tuning and inference stages are leveraged. Fine-tuning refers to adjusting the parameters of pre-trained NLP models with a training set of the specific downstream task, and inference means evaluating the fine-tuned models on the test (new/ unseen) datasets of our downstream task.
  • Figure 3: Illustration of combining local and global attention mechanisms and how to efficiently store the matrix on hardware. full attention (a), local + global attention where the global attention score is marked as pink (b), decomposing the global attention (c) and efficiently stored local attention (d).
  • Figure 4: A demonstration of sliding window mechanism for local attention in Transformer, where the token length is n, the window size is w and i is the i-th token in the sequence.
  • Figure 5: Demonstration of the overall structure of SparseCoder. As shown in this figure, compared with Transformer architecture, our proposed framework SparseCoder (core architecture highlighted with a red dashed box - - - -, which is consisted of sparse attention shown in right module and learned token pruning explained in the left module). In Transformer-based baselines (RoBERTa, CodeBERT and CodeT5), only self-attention mechanism is utilized, as shown in sub-figure (a) in Figure \ref{['fig:attentions']}. While in our SparseCoder, sparse attention mechanism (For details of sparse attention, please see sub-figure (b)-(c) in Figure \ref{['fig:attentions']}) can reduce the computational overhead and extend the token length that model can analysis. SparseCoder further incorporates learned token pruning (as shown in the left module) to prune away trial tokens and reduce the model inference cost.
  • ...and 4 more figures