Table of Contents
Fetching ...

GNN-Coder: Boosting Semantic Code Retrieval with Combined GNNs and Transformer

Yufan Ye, Pu Pang, Ting Zhang, Hua Huang

TL;DR

This paper addresses the semantic gap between natural language queries and code in retrieval by introducing GNN-Coder, a framework that combines AST-guided Graph Neural Networks with Transformer encoders. It proposes ASTGPool, a graph pooling method tailored to ASTs, and the Mean Angular Margin (MAM) metric to quantify embedding separability, all trained with a CLIP-like contrastive objective. Empirical results on CSN and CosQA show consistent MRR improvements across multiple Transformer backbones and notable zero-shot gains, while MAM analyses indicate more discriminative and uniform code embeddings. Overall, GNN-Coder demonstrates that leveraging structural AST information through a tailored GNN enhances semantic code retrieval and provides a principled way to measure embedding quality.

Abstract

Code retrieval is a crucial component in modern software development, particularly in large-scale projects. However, existing approaches relying on sequence-based models often fail to fully exploit the structural dependencies inherent in code, leading to suboptimal retrieval performance, particularly with structurally complex code fragments. In this paper, we introduce GNN-Coder, a novel framework based on Graph Neural Network (GNN) to utilize Abstract Syntax Tree (AST). We make the first attempt to study how GNN-integrated Transformer can promote the development of semantic retrieval tasks by capturing the structural and semantic features of code. We further propose an innovative graph pooling method tailored for AST, utilizing the number of child nodes as a key feature to highlight the intrinsic topological relationships within the AST. This design effectively integrates both sequential and hierarchical representations, enhancing the model's ability to capture code structure and semantics. Additionally, we introduce the Mean Angular Margin (MAM), a novel metric for quantifying the uniformity of code embedding distributions, providing a standardized measure of feature separability. The proposed method achieves a lower MAM, indicating a more discriminative feature representation. This underscores GNN-Coder's superior ability to distinguish between code snippets, thereby enhancing retrieval accuracy. Experimental results show that GNN-Coder significantly boosts retrieval performance, with a 1\%-10\% improvement in MRR on the CSN dataset, and a notable 20\% gain in zero-shot performance on the CosQA dataset.

GNN-Coder: Boosting Semantic Code Retrieval with Combined GNNs and Transformer

TL;DR

This paper addresses the semantic gap between natural language queries and code in retrieval by introducing GNN-Coder, a framework that combines AST-guided Graph Neural Networks with Transformer encoders. It proposes ASTGPool, a graph pooling method tailored to ASTs, and the Mean Angular Margin (MAM) metric to quantify embedding separability, all trained with a CLIP-like contrastive objective. Empirical results on CSN and CosQA show consistent MRR improvements across multiple Transformer backbones and notable zero-shot gains, while MAM analyses indicate more discriminative and uniform code embeddings. Overall, GNN-Coder demonstrates that leveraging structural AST information through a tailored GNN enhances semantic code retrieval and provides a principled way to measure embedding quality.

Abstract

Code retrieval is a crucial component in modern software development, particularly in large-scale projects. However, existing approaches relying on sequence-based models often fail to fully exploit the structural dependencies inherent in code, leading to suboptimal retrieval performance, particularly with structurally complex code fragments. In this paper, we introduce GNN-Coder, a novel framework based on Graph Neural Network (GNN) to utilize Abstract Syntax Tree (AST). We make the first attempt to study how GNN-integrated Transformer can promote the development of semantic retrieval tasks by capturing the structural and semantic features of code. We further propose an innovative graph pooling method tailored for AST, utilizing the number of child nodes as a key feature to highlight the intrinsic topological relationships within the AST. This design effectively integrates both sequential and hierarchical representations, enhancing the model's ability to capture code structure and semantics. Additionally, we introduce the Mean Angular Margin (MAM), a novel metric for quantifying the uniformity of code embedding distributions, providing a standardized measure of feature separability. The proposed method achieves a lower MAM, indicating a more discriminative feature representation. This underscores GNN-Coder's superior ability to distinguish between code snippets, thereby enhancing retrieval accuracy. Experimental results show that GNN-Coder significantly boosts retrieval performance, with a 1\%-10\% improvement in MRR on the CSN dataset, and a notable 20\% gain in zero-shot performance on the CosQA dataset.

Paper Structure

This paper contains 18 sections, 4 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The average MAM for six PLs in CSN dataset. A value close to 0 indicates thorough feature separation.
  • Figure 2: Overall architecture of GNN-Coder. The code is transformed into an AST, which is initialized with a Transformer model, processed by a GNN, and aligned with text embeddings through a contrastive loss function.
  • Figure 3: Illustrating importance score calculation for different pooling methods. "deg" represents in-degree.
  • Figure 4: Illustrating the GNN model, which is a hierarchical architecture incorporated with ASTGPool layer. Here we show a hierarchical depth of 3 and $F1,F2,F3$ represent the features extracted at each corresponding depth.