Table of Contents
Fetching ...

Tackling Long Code Search with Splitting, Encoding, and Aggregating

Fan Hu, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, Xirong Li

TL;DR

Long code snippets pose a critical bottleneck for Transformer-based code search due to quadratic self-attention, which forces truncation and harms retrieval accuracy. The authors introduce SEA, a splits-encodes-aggregates baseline that handles long code without re-pretraining by partitioning code into blocks, encoding each with a pretrained encoder, and aggregating the block embeddings via attention-based or pooling mechanisms; a sliding window reduces computational cost to approximately $O(n^2/k)$, enabling scalable long-code representations. The optimal SEA configuration uses AST-based splitting with a 32-token window, 16-token step, and one-layer attention with mean pooling, and it outperforms three sparse transformers and traditional baselines across CodeSearchNet's six languages, achieving an overall MRR of 0.785 with GraphCodeBERT as the encoder. The approach is encoder-agnostic, integrates a batch-acceleration mechanism (combine-divide), and demonstrates robust improvements for long code, making it a practical baseline for long-code search and a step toward improved long-code understanding in code retrieval tasks.

Abstract

Code search with natural language helps us reuse existing code snippets. Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the quadratic complexity of multi-head self-attention, there is a limit on the input token length. For efficient training on standard GPUs like V100, existing pretrained code models, including GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes them unable to represent the complete information of long code that is greater than 256 tokens. To tackle the long code problem, we propose a new baseline SEA (Split, Encode and Aggregate), which splits long code into code blocks, encodes these blocks into embeddings, and aggregates them to obtain a comprehensive long code representation. With SEA, we could directly use Transformer-based pretraining models to model long code without changing their internal structure and re-pretraining. We also compare SEA with sparse Trasnformer methods. With GraphCodeBERT as the encoder, SEA achieves an overall mean reciprocal ranking score of 0.785, which is 10.1% higher than GraphCodeBERT on the CodeSearchNet benchmark, justifying SEA as a strong baseline for long code search. Our source code and experimental data are available at: https://github.com/fly-dragon211/SEA.

Tackling Long Code Search with Splitting, Encoding, and Aggregating

TL;DR

Long code snippets pose a critical bottleneck for Transformer-based code search due to quadratic self-attention, which forces truncation and harms retrieval accuracy. The authors introduce SEA, a splits-encodes-aggregates baseline that handles long code without re-pretraining by partitioning code into blocks, encoding each with a pretrained encoder, and aggregating the block embeddings via attention-based or pooling mechanisms; a sliding window reduces computational cost to approximately , enabling scalable long-code representations. The optimal SEA configuration uses AST-based splitting with a 32-token window, 16-token step, and one-layer attention with mean pooling, and it outperforms three sparse transformers and traditional baselines across CodeSearchNet's six languages, achieving an overall MRR of 0.785 with GraphCodeBERT as the encoder. The approach is encoder-agnostic, integrates a batch-acceleration mechanism (combine-divide), and demonstrates robust improvements for long code, making it a practical baseline for long-code search and a step toward improved long-code understanding in code retrieval tasks.

Abstract

Code search with natural language helps us reuse existing code snippets. Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the quadratic complexity of multi-head self-attention, there is a limit on the input token length. For efficient training on standard GPUs like V100, existing pretrained code models, including GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes them unable to represent the complete information of long code that is greater than 256 tokens. To tackle the long code problem, we propose a new baseline SEA (Split, Encode and Aggregate), which splits long code into code blocks, encodes these blocks into embeddings, and aggregates them to obtain a comprehensive long code representation. With SEA, we could directly use Transformer-based pretraining models to model long code without changing their internal structure and re-pretraining. We also compare SEA with sparse Trasnformer methods. With GraphCodeBERT as the encoder, SEA achieves an overall mean reciprocal ranking score of 0.785, which is 10.1% higher than GraphCodeBERT on the CodeSearchNet benchmark, justifying SEA as a strong baseline for long code search. Our source code and experimental data are available at: https://github.com/fly-dragon211/SEA.
Paper Structure (26 sections, 10 equations, 5 figures, 6 tables)

This paper contains 26 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Example case of GraphCodeBERT. GraphCodeBERT truncates tokens beyond 256 tokens. Key tokens are highlighted in yellow.
  • Figure 2: The pipeline of our proposed SEA (split, encode and aggregate) architecture.
  • Figure 3: The attention-based aggregation methods.
  • Figure 4: The batch processing combine-divide method. ① and ② refer to combination and division methods.
  • Figure 5: The performance comparison between GraphCodeBERT and SEA in different ground-truth code token lengths. Compare to GraphCodeBERT, SEA achieves significantly ($p < 0.01$) better performance for different code token lengths.