Table of Contents
Fetching ...

Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?

Qingyuan Liang, Zhao Zhang, Zeyu Sun, Zheng Lin, Qi Luo, Yueyi Xiao, Yizhou Chen, Yuqun Zhang, Haotian Zhang, Lu Zhang, Bin Chen, Yingfei Xiong

TL;DR

This work investigates whether grammar-based code representations remain beneficial for billion-scale LLMs, where syntax errors are rare. It introduces GrammarCoder, a decoder-only transformer augmented with a grammar-rule vocabulary and trained through continued pre-training and instruction tuning on Python data. Empirical results on HumanEval(+) and MBPP(+) show that grammar-based representations outperform token-based CPT baselines, with additional gains from semantic-focused analyses that amplify subtle code differences. The findings suggest that incorporating structural grammar information improves semantic differentiation and code generation even at very large scales, and the authors provide public GrammarCoder resources for further research.

Abstract

Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code representations in small-scale models, showing their ability to reduce syntax errors and enhance performance. However, as language models scale to the billion level or beyond, syntax-level errors become rare, making it unclear whether grammar information still provides performance benefits. To explore this, we develop a series of billion-scale GrammarCoder models, incorporating grammar rules in the code generation process. Experiments on HumanEval (+) and MBPP (+) demonstrate a notable improvement in code generation accuracy. Further analysis shows that grammar-based representations enhance LLMs' ability to discern subtle code differences, reducing semantic errors caused by minor variations. These findings suggest that grammar-based code representations remain valuable even in billion-scale models, not only by maintaining syntax correctness but also by improving semantic differentiation.

Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?

TL;DR

This work investigates whether grammar-based code representations remain beneficial for billion-scale LLMs, where syntax errors are rare. It introduces GrammarCoder, a decoder-only transformer augmented with a grammar-rule vocabulary and trained through continued pre-training and instruction tuning on Python data. Empirical results on HumanEval(+) and MBPP(+) show that grammar-based representations outperform token-based CPT baselines, with additional gains from semantic-focused analyses that amplify subtle code differences. The findings suggest that incorporating structural grammar information improves semantic differentiation and code generation even at very large scales, and the authors provide public GrammarCoder resources for further research.

Abstract

Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code representations in small-scale models, showing their ability to reduce syntax errors and enhance performance. However, as language models scale to the billion level or beyond, syntax-level errors become rare, making it unclear whether grammar information still provides performance benefits. To explore this, we develop a series of billion-scale GrammarCoder models, incorporating grammar rules in the code generation process. Experiments on HumanEval (+) and MBPP (+) demonstrate a notable improvement in code generation accuracy. Further analysis shows that grammar-based representations enhance LLMs' ability to discern subtle code differences, reducing semantic errors caused by minor variations. These findings suggest that grammar-based code representations remain valuable even in billion-scale models, not only by maintaining syntax correctness but also by improving semantic differentiation.

Paper Structure

This paper contains 32 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: An example of a grammar representation. The top-left part presents a programming problem along with its corresponding Python solution. The right part illustrates the abstract syntax tree (AST) representation of the Python code. The bottom-left section presents the grammar-based representation.
  • Figure 2: An example showing the differences of code representations between error and correct code.
  • Figure 3: Edit distance distribution across different code representation approaches.
  • Figure 4: DeepSeek-Coder-1.3B-Base (CPT)'s generated output for Task 38 in the HumanEval dataset (left) and the required AST modifications to correct the code (right).
  • Figure 5: DeepSeek-Coder-1.3B-Base (CPT)'s generated output for Task 147 in the HumanEval dataset (left) and the required AST modifications to correct the code (right). For clarity, we represent identical computational units before and after modification using A, B, and C, respectively.