Table of Contents
Fetching ...

Code Prediction by Feeding Trees to Transformers

Seohyun Kim, Jinman Zhao, Yuchi Tian, Satish Chandra

TL;DR

The paper tackles code autocomplete by predicting the next token, showing that Transformer models outperform prior baselines and that incorporating AST structure yields additional gains. It introduces three architectures—SeqTrans, PathTrans, and TravTrans—and a more structure-aware variant TravTrans+ to encode code syntax for prediction. Across the py150 dataset and a Facebook internal Python corpus, TravTrans achieves the highest accuracy, with relative improvements of about 14–18% over prior methods like Deep3 and Code2Seq, and substantial gains over RNN baselines. The work also provides interpretability analyses via saliency and discusses limitations such as OOV handling, Python-specificity, and dataset scope, proposing directions for future research and code-data openness.

Abstract

We advance the state-of-the-art in the accuracy of code prediction (next token prediction) used in autocomplete systems. First, we report that using the recently proposed Transformer architecture even out-of-the-box outperforms previous neural and non-neural systems for code prediction. We then show that by making the Transformer architecture aware of the syntactic structure of code, we further increase the margin by which a Transformer-based system outperforms previous systems. With this, it outperforms the accuracy of an RNN-based system (similar to Hellendoorn et al. 2018) by 18.3%, the Deep3 system (Raychev et al 2016) by 14.1%, and an adaptation of Code2Seq (Alon et al., 2018) for code prediction by 14.4%. We present in the paper several ways of communicating the code structure to the Transformer, which is fundamentally built for processing sequence data. We provide a comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a Facebook internal Python corpus. Our code and data preparation pipeline will be available in open source.

Code Prediction by Feeding Trees to Transformers

TL;DR

The paper tackles code autocomplete by predicting the next token, showing that Transformer models outperform prior baselines and that incorporating AST structure yields additional gains. It introduces three architectures—SeqTrans, PathTrans, and TravTrans—and a more structure-aware variant TravTrans+ to encode code syntax for prediction. Across the py150 dataset and a Facebook internal Python corpus, TravTrans achieves the highest accuracy, with relative improvements of about 14–18% over prior methods like Deep3 and Code2Seq, and substantial gains over RNN baselines. The work also provides interpretability analyses via saliency and discusses limitations such as OOV handling, Python-specificity, and dataset scope, proposing directions for future research and code-data openness.

Abstract

We advance the state-of-the-art in the accuracy of code prediction (next token prediction) used in autocomplete systems. First, we report that using the recently proposed Transformer architecture even out-of-the-box outperforms previous neural and non-neural systems for code prediction. We then show that by making the Transformer architecture aware of the syntactic structure of code, we further increase the margin by which a Transformer-based system outperforms previous systems. With this, it outperforms the accuracy of an RNN-based system (similar to Hellendoorn et al. 2018) by 18.3%, the Deep3 system (Raychev et al 2016) by 14.1%, and an adaptation of Code2Seq (Alon et al., 2018) for code prediction by 14.4%. We present in the paper several ways of communicating the code structure to the Transformer, which is fundamentally built for processing sequence data. We provide a comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a Facebook internal Python corpus. Our code and data preparation pipeline will be available in open source.

Paper Structure

This paper contains 38 sections, 10 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Screenshots of ranked autocomplete predictions from three different models--Type-based alphabetical, SeqRNN, and TravTrans, as would appear in an IDE. A type-based autocomplete tool such as Jedi that sorts choices alphabetically ranks "atoi" low. SeqRNN, a RNN-based model predicts it as the second result. TravTrans, a Transformer-based model predicts it as the first result. Fewer keystrokes are needed to choose the correct answer as we go from left to right.
  • Figure 2: Running example of Python code. The code snippet is from the py150 dataset py150.
  • Figure 3: Part of the AST for the example in Fig \ref{['fig:examplecode']}. The leaf (terminal) nodes have values and the interior (non-terminal) nodes have types.
  • Figure 4: Fragment of a TGEN program encoding a decision tree on the left (bold words are the steps that comprise a path), with the corresponding paths shown on the AST on the right.
  • Figure 5: Example of an input for Code2Seq, which consists of leaf-to-leaf path representations given a partial AST. A path representation is made of tokenized starting tokens, path, and tokenized ending tokens. If the path ends with the target node (in this example, atoi), the value is replaced by <placehholder>.
  • ...and 3 more figures