Code Prediction by Feeding Trees to Transformers
Seohyun Kim, Jinman Zhao, Yuchi Tian, Satish Chandra
TL;DR
The paper tackles code autocomplete by predicting the next token, showing that Transformer models outperform prior baselines and that incorporating AST structure yields additional gains. It introduces three architectures—SeqTrans, PathTrans, and TravTrans—and a more structure-aware variant TravTrans+ to encode code syntax for prediction. Across the py150 dataset and a Facebook internal Python corpus, TravTrans achieves the highest accuracy, with relative improvements of about 14–18% over prior methods like Deep3 and Code2Seq, and substantial gains over RNN baselines. The work also provides interpretability analyses via saliency and discusses limitations such as OOV handling, Python-specificity, and dataset scope, proposing directions for future research and code-data openness.
Abstract
We advance the state-of-the-art in the accuracy of code prediction (next token prediction) used in autocomplete systems. First, we report that using the recently proposed Transformer architecture even out-of-the-box outperforms previous neural and non-neural systems for code prediction. We then show that by making the Transformer architecture aware of the syntactic structure of code, we further increase the margin by which a Transformer-based system outperforms previous systems. With this, it outperforms the accuracy of an RNN-based system (similar to Hellendoorn et al. 2018) by 18.3%, the Deep3 system (Raychev et al 2016) by 14.1%, and an adaptation of Code2Seq (Alon et al., 2018) for code prediction by 14.4%. We present in the paper several ways of communicating the code structure to the Transformer, which is fundamentally built for processing sequence data. We provide a comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a Facebook internal Python corpus. Our code and data preparation pipeline will be available in open source.
