Table of Contents
Fetching ...

Bridging Code Graphs and Large Language Models for Better Code Understanding

Zeqi Chen, Zhaoyang Chu, Yi Gui, Feng Guo, Yao Wan, Chuan Shi

TL;DR

This work addresses the mismatch between linear token processing in LLMs and the intrinsic structural semantics of code. It introduces CGBridge, a plug-and-play framework that learns code structure via a Code Graph Encoder and aligns code, graph, and text through a Bridge Module to generate structure-informed prompts for a frozen LLM. The approach yields clear gains in code summarization and translation, improves robustness to variable renaming, and delivers substantial inference speedups over LoRA-based methods. These results demonstrate that modular, structure-aware prompting can enhance code understanding while remaining efficient for large-scale models.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance in code intelligence tasks such as code generation, summarization, and translation. However, their reliance on linearized token sequences limits their ability to understand the structural semantics of programs. While prior studies have explored graphaugmented prompting and structure-aware pretraining, they either suffer from prompt length constraints or require task-specific architectural changes that are incompatible with large-scale instructionfollowing LLMs. To address these limitations, this paper proposes CGBridge, a novel plug-and-play method that enhances LLMs with Code Graph information through an external, trainable Bridge module. CGBridge first pre-trains a code graph encoder via selfsupervised learning on a large-scale dataset of 270K code graphs to learn structural code semantics. It then trains an external module to bridge the modality gap among code, graph, and text by aligning their semantics through cross-modal attention mechanisms. Finally, the bridge module generates structure-informed prompts, which are injected into a frozen LLM, and is fine-tuned for downstream code intelligence tasks. Experiments show that CGBridge achieves notable improvements over both the original model and the graphaugmented prompting method. Specifically, it yields a 16.19% and 9.12% relative gain in LLM-as-a-Judge on code summarization, and a 9.84% and 38.87% relative gain in Execution Accuracy on code translation. Moreover, CGBridge achieves over 4x faster inference than LoRA-tuned models, demonstrating both effectiveness and efficiency in structure-aware code understanding.

Bridging Code Graphs and Large Language Models for Better Code Understanding

TL;DR

This work addresses the mismatch between linear token processing in LLMs and the intrinsic structural semantics of code. It introduces CGBridge, a plug-and-play framework that learns code structure via a Code Graph Encoder and aligns code, graph, and text through a Bridge Module to generate structure-informed prompts for a frozen LLM. The approach yields clear gains in code summarization and translation, improves robustness to variable renaming, and delivers substantial inference speedups over LoRA-based methods. These results demonstrate that modular, structure-aware prompting can enhance code understanding while remaining efficient for large-scale models.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance in code intelligence tasks such as code generation, summarization, and translation. However, their reliance on linearized token sequences limits their ability to understand the structural semantics of programs. While prior studies have explored graphaugmented prompting and structure-aware pretraining, they either suffer from prompt length constraints or require task-specific architectural changes that are incompatible with large-scale instructionfollowing LLMs. To address these limitations, this paper proposes CGBridge, a novel plug-and-play method that enhances LLMs with Code Graph information through an external, trainable Bridge module. CGBridge first pre-trains a code graph encoder via selfsupervised learning on a large-scale dataset of 270K code graphs to learn structural code semantics. It then trains an external module to bridge the modality gap among code, graph, and text by aligning their semantics through cross-modal attention mechanisms. Finally, the bridge module generates structure-informed prompts, which are injected into a frozen LLM, and is fine-tuned for downstream code intelligence tasks. Experiments show that CGBridge achieves notable improvements over both the original model and the graphaugmented prompting method. Specifically, it yields a 16.19% and 9.12% relative gain in LLM-as-a-Judge on code summarization, and a 9.84% and 38.87% relative gain in Execution Accuracy on code translation. Moreover, CGBridge achieves over 4x faster inference than LoRA-tuned models, demonstrating both effectiveness and efficiency in structure-aware code understanding.

Paper Structure

This paper contains 38 sections, 14 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: A motivating example.
  • Figure 2: A Python function (left) and its corresponding Code Property Graph (CPG, right).
  • Figure 3: Overview of the CGBridge framework.
  • Figure 4: Ablation experiments of training components.
  • Figure 5: Performance across different code length bins.
  • ...and 4 more figures