Table of Contents
Fetching ...

CodePAD: Sequence-based Code Generation with Pushdown Automaton

Yihong Dong, Xue Jiang, Yuchen Liu, Ge Li, Zhi Jin

TL;DR

CodePAD introduces a pushdown automaton (PDA) module to enforce grammar during sequence-based code generation, addressing the lack of grammatical guarantees in popular models. By deriving grammar-consistent next-token sets and integrating PDA state information into a Transformer-based decoder, CodePAD achieves 100% grammatical correctness on benchmark Python datasets and yields substantial improvements over non-pretrained and some pretrained baselines. The approach includes token- and state-prediction tasks with a joint prediction mechanism, and ablation studies confirm the critical role of the PDA component. The method shows strong zero-shot gains for pretrained models and has potential to generalize to other context-free languages beyond Python, albeit with language-specific PDAs and some computational overhead.

Abstract

In the process of code generation, it is essential to guarantee the generated code satisfies grammar constraints of programming language (PL). However, neglecting grammar constraints is a fatal drawback of commonly used sequence-based code generation. In this paper, we devise a pushdown automaton (PDA)-based methodology to address this problem, exploiting the principle that PL is a subset of PDA recognizable language and code accepted by PDA is grammatical. Specifically, we construct a PDA module and design an algorithm to constrain the generation of sequence-based models to ensure grammatical correctness. Guided by this methodology, we further propose CodePAD, a sequence-based code generation framework equipped with a PDA module, to integrate the deduction of PDA into deep learning. Additionally, this framework can leverage states of PDA deduction (including state representation, state prediction task, and joint prediction with state) to assist models in learning PDA deduction. To comprehensively evaluate CodePAD, we construct a PDA for Python and conduct extensive experiments on four public benchmark datasets. CodePAD can leverage existing sequence-based models, and we show that it can achieve 100\% grammatical correctness percentage on these benchmark datasets. Thus, it relatively improve 17\% CodeBLEU on CONALA, 8\% EM on DJANGO, and 15\% CodeBLEU on JUICE-10K compared to base models. In addition, our method significantly enhances pre-trained models, e.g., CodeBLEU of CodeGen-350M improvement from 3.21 to 21.54 on MBPP in zero-shot setting.

CodePAD: Sequence-based Code Generation with Pushdown Automaton

TL;DR

CodePAD introduces a pushdown automaton (PDA) module to enforce grammar during sequence-based code generation, addressing the lack of grammatical guarantees in popular models. By deriving grammar-consistent next-token sets and integrating PDA state information into a Transformer-based decoder, CodePAD achieves 100% grammatical correctness on benchmark Python datasets and yields substantial improvements over non-pretrained and some pretrained baselines. The approach includes token- and state-prediction tasks with a joint prediction mechanism, and ablation studies confirm the critical role of the PDA component. The method shows strong zero-shot gains for pretrained models and has potential to generalize to other context-free languages beyond Python, albeit with language-specific PDAs and some computational overhead.

Abstract

In the process of code generation, it is essential to guarantee the generated code satisfies grammar constraints of programming language (PL). However, neglecting grammar constraints is a fatal drawback of commonly used sequence-based code generation. In this paper, we devise a pushdown automaton (PDA)-based methodology to address this problem, exploiting the principle that PL is a subset of PDA recognizable language and code accepted by PDA is grammatical. Specifically, we construct a PDA module and design an algorithm to constrain the generation of sequence-based models to ensure grammatical correctness. Guided by this methodology, we further propose CodePAD, a sequence-based code generation framework equipped with a PDA module, to integrate the deduction of PDA into deep learning. Additionally, this framework can leverage states of PDA deduction (including state representation, state prediction task, and joint prediction with state) to assist models in learning PDA deduction. To comprehensively evaluate CodePAD, we construct a PDA for Python and conduct extensive experiments on four public benchmark datasets. CodePAD can leverage existing sequence-based models, and we show that it can achieve 100\% grammatical correctness percentage on these benchmark datasets. Thus, it relatively improve 17\% CodeBLEU on CONALA, 8\% EM on DJANGO, and 15\% CodeBLEU on JUICE-10K compared to base models. In addition, our method significantly enhances pre-trained models, e.g., CodeBLEU of CodeGen-350M improvement from 3.21 to 21.54 on MBPP in zero-shot setting.
Paper Structure (38 sections, 10 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 38 sections, 10 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Examples of a parser parsing Python code.
  • Figure 2: Motivation Example.
  • Figure 3: Schematic diagram of a Python grammar PDA parsing Python code.
  • Figure 4: Diagram of CodePAD.
  • Figure 5: An example of state prediction task.
  • ...and 1 more figures