Table of Contents
Fetching ...

SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization

Yanlin Wang, Yanxian Huang, Daya Guo, Hongyu Zhang, Zibin Zheng

TL;DR

This work tackles file-level code summarization, a long-input challenge for standard Transformers. It introduces SparseCoder, a three-pattern identifier-aware sparse Transformer with local, global, and identifier attention, plus LoRA for parameter efficiency, evaluated on the new FILE-CS dataset. Results show state-of-the-art performance and favorable memory characteristics, with ablations confirming the value of global and identifier attention and the efficiency gain from LoRA. The model also demonstrates generality to code clone detection and code search across datasets, suggesting practical impact for large-scale code understanding tasks.

Abstract

Code summarization aims to generate natural language descriptions of source code, facilitating programmers to understand and maintain it rapidly. While previous code summarization efforts have predominantly focused on method-level, this paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. Unlike method-level code summarization,file-level code summarization typically involves long source code within a single file, which makes it challenging for Transformer-based models to understand the code semantics for the maximum input length of these models is difficult to set to a large number that can handle long code input well, due to the quadratic scaling of computational complexity with the input sequence length. To address this challenge, we propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences. Specifically, the SparseCoder employs a sliding window mechanism for self-attention to model short-term dependencies and leverages the structure message of code to capture long-term dependencies among source code identifiers by introducing two types of sparse attention patterns named global and identifier attention. To evaluate the performance of SparseCoder, we construct a new dataset FILE-CS for file-level code summarization in Python. Experimental results show that our SparseCoder model achieves state-of-the-art performance compared with other pre-trained models, including full self-attention and sparse models. Additionally, our model has low memory overhead and achieves comparable performance with models using full self-attention mechanism.

SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization

TL;DR

This work tackles file-level code summarization, a long-input challenge for standard Transformers. It introduces SparseCoder, a three-pattern identifier-aware sparse Transformer with local, global, and identifier attention, plus LoRA for parameter efficiency, evaluated on the new FILE-CS dataset. Results show state-of-the-art performance and favorable memory characteristics, with ablations confirming the value of global and identifier attention and the efficiency gain from LoRA. The model also demonstrates generality to code clone detection and code search across datasets, suggesting practical impact for large-scale code understanding tasks.

Abstract

Code summarization aims to generate natural language descriptions of source code, facilitating programmers to understand and maintain it rapidly. While previous code summarization efforts have predominantly focused on method-level, this paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. Unlike method-level code summarization,file-level code summarization typically involves long source code within a single file, which makes it challenging for Transformer-based models to understand the code semantics for the maximum input length of these models is difficult to set to a large number that can handle long code input well, due to the quadratic scaling of computational complexity with the input sequence length. To address this challenge, we propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences. Specifically, the SparseCoder employs a sliding window mechanism for self-attention to model short-term dependencies and leverages the structure message of code to capture long-term dependencies among source code identifiers by introducing two types of sparse attention patterns named global and identifier attention. To evaluate the performance of SparseCoder, we construct a new dataset FILE-CS for file-level code summarization in Python. Experimental results show that our SparseCoder model achieves state-of-the-art performance compared with other pre-trained models, including full self-attention and sparse models. Additionally, our model has low memory overhead and achieves comparable performance with models using full self-attention mechanism.
Paper Structure (35 sections, 13 equations, 5 figures, 6 tables)

This paper contains 35 sections, 13 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The overview of SparseCoder. Global and local identifiers are marked in yellow and red, respectively.
  • Figure 2: A code example of the attention matrix map.
  • Figure 3: The effect of input sequence length on memory usage and the average score of the three metrics.
  • Figure 4: An example on FILE-CS dataset and the predictions from different models. The input code file is displayed on the left, and the predictions from the models are provided on the right. Arrows indicate the key call relationship.
  • Figure 5: Another example on FILE-CS dataset and the predictions from different models. The input code file is displayed on the left, and the predictions from the models are provided on the right. Arrows indicate the key call relationship.