Table of Contents
Fetching ...

Neural Models for Source Code Synthesis and Completion

Mitodru Niyogi

TL;DR

This work treats natural-language-to-code translation and live code completion as a unified neural code synthesis problem, moving beyond rule-based semantic parsers. It compares Seq2Seq, Transformer, and hybrid architectures, introducing Seq2Seq-BART (BERT encoder, GPT decoder) and Seq2Seq-RoBERTa hybrids, augmented by back-translation data and pretraining on the CoNaLa mined corpus. The Seq2Seq-BART variant achieves state-of-the-art BLEU-4 scores on CoNaLa, surpassing TranX by approximately 10.8% and generating more valid compilable code snippets, while the approach also supports Code2NL bidirectionality. Additionally, a RoBERTa-based Python language model (CuRoBERTa-LM) is developed for code completion, demonstrating effective masked-language modeling on code. Overall, the study shows that pretraining, data augmentation, and subword tokenization substantially improve NL2Code and Code2NL performance and that transformer-based architectures with autoregressive decoders yield richer, more diverse code translations suitable for real-time IDE usage.

Abstract

Natural language (NL) to code suggestion systems assist developers in Integrated Development Environments (IDEs) by translating NL utterances into compilable code snippet. The current approaches mainly involve hard-coded, rule-based systems based on semantic parsing. These systems make heavy use of hand-crafted rules that map patterns in NL or elements in its syntax parse tree to various query constructs and can only work on a limited subset of NL with a restricted NL syntax. These systems are unable to extract semantic information from the coding intents of the developer, and often fail to infer types, names, and the context of the source code to get accurate system-level code suggestions. In this master thesis, we present sequence-to-sequence deep learning models and training paradigms to map NL to general-purpose programming languages that can assist users with suggestions of source code snippets, given a NL intent, and also extend auto-completion functionality of the source code to users while they are writing source code. The developed architecture incorporates contextual awareness into neural models which generate source code tokens directly instead of generating parse trees/abstract meaning representations from the source code and converting them back to source code. The proposed pretraining strategy and the data augmentation techniques improve the performance of the proposed architecture. The proposed architecture has been found to exceed the performance of a neural semantic parser, TranX, based on the BLEU-4 metric by 10.82%. Thereafter, a finer analysis for the parsable code translations from the NL intent for CoNaLA challenge was introduced. The proposed system is bidirectional as it can be also used to generate NL code documentation given source code. Lastly, a RoBERTa masked language model for Python was proposed to extend the developed system for code completion.

Neural Models for Source Code Synthesis and Completion

TL;DR

This work treats natural-language-to-code translation and live code completion as a unified neural code synthesis problem, moving beyond rule-based semantic parsers. It compares Seq2Seq, Transformer, and hybrid architectures, introducing Seq2Seq-BART (BERT encoder, GPT decoder) and Seq2Seq-RoBERTa hybrids, augmented by back-translation data and pretraining on the CoNaLa mined corpus. The Seq2Seq-BART variant achieves state-of-the-art BLEU-4 scores on CoNaLa, surpassing TranX by approximately 10.8% and generating more valid compilable code snippets, while the approach also supports Code2NL bidirectionality. Additionally, a RoBERTa-based Python language model (CuRoBERTa-LM) is developed for code completion, demonstrating effective masked-language modeling on code. Overall, the study shows that pretraining, data augmentation, and subword tokenization substantially improve NL2Code and Code2NL performance and that transformer-based architectures with autoregressive decoders yield richer, more diverse code translations suitable for real-time IDE usage.

Abstract

Natural language (NL) to code suggestion systems assist developers in Integrated Development Environments (IDEs) by translating NL utterances into compilable code snippet. The current approaches mainly involve hard-coded, rule-based systems based on semantic parsing. These systems make heavy use of hand-crafted rules that map patterns in NL or elements in its syntax parse tree to various query constructs and can only work on a limited subset of NL with a restricted NL syntax. These systems are unable to extract semantic information from the coding intents of the developer, and often fail to infer types, names, and the context of the source code to get accurate system-level code suggestions. In this master thesis, we present sequence-to-sequence deep learning models and training paradigms to map NL to general-purpose programming languages that can assist users with suggestions of source code snippets, given a NL intent, and also extend auto-completion functionality of the source code to users while they are writing source code. The developed architecture incorporates contextual awareness into neural models which generate source code tokens directly instead of generating parse trees/abstract meaning representations from the source code and converting them back to source code. The proposed pretraining strategy and the data augmentation techniques improve the performance of the proposed architecture. The proposed architecture has been found to exceed the performance of a neural semantic parser, TranX, based on the BLEU-4 metric by 10.82%. Thereafter, a finer analysis for the parsable code translations from the NL intent for CoNaLA challenge was introduced. The proposed system is bidirectional as it can be also used to generate NL code documentation given source code. Lastly, a RoBERTa masked language model for Python was proposed to extend the developed system for code completion.
Paper Structure (82 sections, 4 equations, 24 figures, 38 tables)

This paper contains 82 sections, 4 equations, 24 figures, 38 tables.

Figures (24)

  • Figure 1: The Transformer model architecture. Figure drawn from vaswani2017attention.
  • Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consisting of several attention layers running in parallel. Figure drawn from vaswani2017attention.
  • Figure 3: Encoding representation in BERT. Figure drawn from devlin-etal-2019-bert.
  • Figure 4: BART: Inputs to the encoder do not need to be aligned with the decoder outputs, allowing arbitrary noise transformations. The corrupted document is encoded with a bidirectional encoder (left), then the likelihood of the original document is calculated with an auto-regressive decoder (right). Figure drawn from lewis2019bart.
  • Figure 5: Transformations as part of Pre-training objectives for noising the input in BART. Figure drawn from lewis2019bart.
  • ...and 19 more figures