Table of Contents
Fetching ...

CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Kounianhua Du, Jizheng Chen, Renting Rui, Huacan Chai, Lingyue Fu, Wei Xia, Yasheng Wang, Ruiming Tang, Yong Yu, Weinan Zhang

TL;DR

The paper tackles the vocabulary mismatch and syntactic gap between natural language and programming languages that hinder NL-to-code generation. It introduces CodeGRAG, a framework that constructs a composed syntax graph from AST, data-flow, and control-flow with read-write signals, and uses retrieval-augmented generation to inform LLMs. Two prompting strategies are proposed: a hard meta-graph prompt for tuning-free models and a soft prompting path that injects GraphEmb into model parameters via a GNN expert with alignment and structure-preserving objectives. Empirically, CodeGRAG yields consistent gains over baselines and demonstrates cross-language improvements, highlighting the practical potential of graphical programming knowledge in enhancing code generation.

Abstract

Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing between natural language and programming languages. In this paper, we propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework that bridges the gap between NL and PL to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge, which can facilitate natural language based LLMs for better understanding of code syntax and serve as a bridge among different programming languages. To take the extracted structural knowledge into the foundation models, we propose 1) a hard meta-graph prompt template to transform the challenging syntax graph into informative graphical view for tuning-free models and 2) a soft prompting technique that injects the domain knowledge of programming languages into model parameters via finetuning the models with the soft signals encoded by GNN expert model. Specifically, two constraints are designed to improve the alignment and structure expressiveness, contributing to the informativeness of the single-token-sized external <GraphEmb> for enhanced code generation. CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation. Implementation is available at https://anonymous.4open.science/r/Code-5970/ .

CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

TL;DR

The paper tackles the vocabulary mismatch and syntactic gap between natural language and programming languages that hinder NL-to-code generation. It introduces CodeGRAG, a framework that constructs a composed syntax graph from AST, data-flow, and control-flow with read-write signals, and uses retrieval-augmented generation to inform LLMs. Two prompting strategies are proposed: a hard meta-graph prompt for tuning-free models and a soft prompting path that injects GraphEmb into model parameters via a GNN expert with alignment and structure-preserving objectives. Empirically, CodeGRAG yields consistent gains over baselines and demonstrates cross-language improvements, highlighting the practical potential of graphical programming knowledge in enhancing code generation.

Abstract

Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing between natural language and programming languages. In this paper, we propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework that bridges the gap between NL and PL to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge, which can facilitate natural language based LLMs for better understanding of code syntax and serve as a bridge among different programming languages. To take the extracted structural knowledge into the foundation models, we propose 1) a hard meta-graph prompt template to transform the challenging syntax graph into informative graphical view for tuning-free models and 2) a soft prompting technique that injects the domain knowledge of programming languages into model parameters via finetuning the models with the soft signals encoded by GNN expert model. Specifically, two constraints are designed to improve the alignment and structure expressiveness, contributing to the informativeness of the single-token-sized external <GraphEmb> for enhanced code generation. CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation. Implementation is available at https://anonymous.4open.science/r/Code-5970/ .
Paper Structure (18 sections, 2 equations, 5 figures, 5 tables)

This paper contains 18 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of the gap between the programming language and the natural language.
  • Figure 2: Overview of CodeGRAG. (Top) Knowledge Preparation. We extract composed syntax graphs of external code blocks by composing the control flow and data flow of codes using the read-write signal, preserving the innate semantic and logical information. The composed graphs are then abstracted into graphical views as hard knowledge document and embedded into <GraphEmb>s as soft knowledge document. The <GraphEmb> is encoded by a pretrained GNN expert model constrained by the alignment and structure preserving objectives. (Bottom) Retrieval Augmented Generation. We extract query from the task input and retrieve from the external corpus. For tuning free models, we use the hard graphical view to stimulate the structural programming knowledge of LLMs for enhanced generation. For tunable models, we use the soft <GraphEmb> and inject the programming domain knowledge into LLMs parameters via finetuning them with the GNN expert signals. The expert signals informed LLMs can then produce enhanced generation.
  • Figure 3: Illustration of the extracted composed syntax graph from the code block. The arrows in the bottom part indicate the names of different edges, which are extracted based on the ASTs.
  • Figure 4: Prompt templates.
  • Figure 5: T-sne visualization of soft signals trained with different objectives.