Table of Contents
Fetching ...

StructCoder: Structure-Aware Transformer for Code Generation

Sindhu Tipirneni, Ming Zhu, Chandan K. Reddy

TL;DR

StructCoder addresses code generation by incorporating code structure into both the encoder and decoder. It introduces a structure-aware Transformer with an AST/DFG-informed encoder and a decoder trained via AST Paths Prediction and Data Flow Prediction, plus a structure-based denoising autoencoder pretraining objective. The model achieves state-of-the-art results on CodeXGLUE code translation and text-to-code tasks and strong APPS performance, with ablations confirming the contribution of the structural components. This work demonstrates the importance of leveraging syntax and data flow to improve code generation, while also acknowledging computational costs and deployment considerations.

Abstract

There has been a recent surge of interest in automating software engineering tasks using deep learning. This paper addresses the problem of code generation, where the goal is to generate target code given source code in a different language or a natural language description. Most state-of-the-art deep learning models for code generation use training strategies primarily designed for natural language. However, understanding and generating code requires a more rigorous comprehension of the code syntax and semantics. With this motivation, we develop an encoder-decoder Transformer model where both the encoder and decoder are explicitly trained to recognize the syntax and data flow in the source and target codes, respectively. We not only make the encoder structure-aware by leveraging the source code's syntax tree and data flow graph, but we also support the decoder in preserving the syntax and data flow of the target code by introducing two novel auxiliary tasks: AST (Abstract Syntax Tree) paths prediction and data flow prediction. To the best of our knowledge, this is the first work to introduce a structure-aware Transformer decoder that models both syntax and data flow to enhance the quality of generated code. The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks in the CodeXGLUE benchmark, and improves over baselines of similar size on the APPS code generation benchmark. Our code is publicly available at https://github.com/reddy-lab-code-research/StructCoder/.

StructCoder: Structure-Aware Transformer for Code Generation

TL;DR

StructCoder addresses code generation by incorporating code structure into both the encoder and decoder. It introduces a structure-aware Transformer with an AST/DFG-informed encoder and a decoder trained via AST Paths Prediction and Data Flow Prediction, plus a structure-based denoising autoencoder pretraining objective. The model achieves state-of-the-art results on CodeXGLUE code translation and text-to-code tasks and strong APPS performance, with ablations confirming the contribution of the structural components. This work demonstrates the importance of leveraging syntax and data flow to improve code generation, while also acknowledging computational costs and deployment considerations.

Abstract

There has been a recent surge of interest in automating software engineering tasks using deep learning. This paper addresses the problem of code generation, where the goal is to generate target code given source code in a different language or a natural language description. Most state-of-the-art deep learning models for code generation use training strategies primarily designed for natural language. However, understanding and generating code requires a more rigorous comprehension of the code syntax and semantics. With this motivation, we develop an encoder-decoder Transformer model where both the encoder and decoder are explicitly trained to recognize the syntax and data flow in the source and target codes, respectively. We not only make the encoder structure-aware by leveraging the source code's syntax tree and data flow graph, but we also support the decoder in preserving the syntax and data flow of the target code by introducing two novel auxiliary tasks: AST (Abstract Syntax Tree) paths prediction and data flow prediction. To the best of our knowledge, this is the first work to introduce a structure-aware Transformer decoder that models both syntax and data flow to enhance the quality of generated code. The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks in the CodeXGLUE benchmark, and improves over baselines of similar size on the APPS code generation benchmark. Our code is publicly available at https://github.com/reddy-lab-code-research/StructCoder/.
Paper Structure (30 sections, 13 equations, 7 figures, 8 tables)

This paper contains 30 sections, 13 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Structure-aware encoder: The input sequence to the encoder consists of source code concatenated with the AST leaves and DFG variables, where the AST leaves are embedded using the root-leaf paths in the AST. The modified structure-aware self-attention mechanism of this Transformer encoder utilizes code-AST/DFG linking information, leaf-leaf similarities in the AST, and the (asymmetric) DFG adjacency matrix to compute the attention matrix.
  • Figure 2: Structure-aware decoder generates the next token in the target code as well as predicts the node types on the root-leaf path to the leaf containing this token in the target AST and also the DFG edges incident on this token.
  • Figure 3: Case study: An example from Java-C# translation task comparing the outputs from StructCoder and CodeT5. StructCoder only makes one error by assuming that 'cells' is an array of 'Cell' objects instead of dictionary with values of type 'Cell'. CodeT5, however, misses the first 'if' statement, produces unbalanced '}', and does not define variabe 'c'. The blue arrows in StructCoder output show the correctly predicted (probability $> 97^{th}$ percentile) data flow edges incident on variable 'c'.)
  • Figure 4: (a) Inference time (in seconds) per sample averaged over 200 samples, and (b) average input length per batch for the 200 samples in the CodeXGLUE translation tasks for model versions including/excluding AST/DFG related components in the encoder. Since the decoder's structure-based components are not active during inference, we did not consider them in this plot.
  • Figure B1: An example from the concode dataset with BLEU=78.85.
  • ...and 2 more figures