Table of Contents
Fetching ...

Learning Python Code Suggestion with a Sparse Pointer Network

Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, Sebastian Riedel

TL;DR

This paper addresses code suggestion for dynamic languages like Python by overcoming the fixed-memory bottleneck of standard neural language models. It introduces a Sparse Pointer Network that selectively attends to a memory of previously introduced identifiers, controlled to mix copying from identifiers with generation from a standard LM. A large-scale Python corpus of $41$M lines is released, and extensive experiments show that the sparse pointer mechanism dramatically improves perplexity and top-$1$/top-$5$ accuracies, with a particularly large boost in predicting identifiers (up to $13\times$ more accurate). Qualitative analyses demonstrate the model's ability to capture long-range dependencies, such as referencing a class member defined over $60$ tokens in the past, suggesting practical impact for IDE code completion in dynamic languages.

Abstract

To enhance developer productivity, all modern integrated development environments (IDEs) include code suggestion functionality that proposes likely next tokens at the cursor. While current IDEs work well for statically-typed languages, their reliance on type annotations means that they do not provide the same level of support for dynamic programming languages as for statically-typed languages. Moreover, suggestion engines in modern IDEs do not propose expressions or multi-statement idiomatic code. Recent work has shown that language models can improve code suggestion systems by learning from software repositories. This paper introduces a neural language model with a sparse pointer network aimed at capturing very long-range dependencies. We release a large-scale code suggestion corpus of 41M lines of Python code crawled from GitHub. On this corpus, we found standard neural language models to perform well at suggesting local phenomena, but struggle to refer to identifiers that are introduced many tokens in the past. By augmenting a neural language model with a pointer network specialized in referring to predefined classes of identifiers, we obtain a much lower perplexity and a 5 percentage points increase in accuracy for code suggestion compared to an LSTM baseline. In fact, this increase in code suggestion accuracy is due to a 13 times more accurate prediction of identifiers. Furthermore, a qualitative analysis shows this model indeed captures interesting long-range dependencies, like referring to a class member defined over 60 tokens in the past.

Learning Python Code Suggestion with a Sparse Pointer Network

TL;DR

This paper addresses code suggestion for dynamic languages like Python by overcoming the fixed-memory bottleneck of standard neural language models. It introduces a Sparse Pointer Network that selectively attends to a memory of previously introduced identifiers, controlled to mix copying from identifiers with generation from a standard LM. A large-scale Python corpus of M lines is released, and extensive experiments show that the sparse pointer mechanism dramatically improves perplexity and top-/top- accuracies, with a particularly large boost in predicting identifiers (up to more accurate). Qualitative analyses demonstrate the model's ability to capture long-range dependencies, such as referencing a class member defined over tokens in the past, suggesting practical impact for IDE code completion in dynamic languages.

Abstract

To enhance developer productivity, all modern integrated development environments (IDEs) include code suggestion functionality that proposes likely next tokens at the cursor. While current IDEs work well for statically-typed languages, their reliance on type annotations means that they do not provide the same level of support for dynamic programming languages as for statically-typed languages. Moreover, suggestion engines in modern IDEs do not propose expressions or multi-statement idiomatic code. Recent work has shown that language models can improve code suggestion systems by learning from software repositories. This paper introduces a neural language model with a sparse pointer network aimed at capturing very long-range dependencies. We release a large-scale code suggestion corpus of 41M lines of Python code crawled from GitHub. On this corpus, we found standard neural language models to perform well at suggesting local phenomena, but struggle to refer to identifiers that are introduced many tokens in the past. By augmenting a neural language model with a pointer network specialized in referring to predefined classes of identifiers, we obtain a much lower perplexity and a 5 percentage points increase in accuracy for code suggestion compared to an LSTM baseline. In fact, this increase in code suggestion accuracy is due to a 13 times more accurate prediction of identifiers. Furthermore, a qualitative analysis shows this model indeed captures interesting long-range dependencies, like referring to a class member defined over 60 tokens in the past.

Paper Structure

This paper contains 12 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Sparse pointer network for code suggestion on a Python code snippet, showing the next-word distributions of the language model and identifier attention and their weighted combination through $\bm{\lambda}$
  • Figure 2: Example of the Python code normalization. Original file on the left and normalized version on the right.
  • Figure 3: Code suggestion example involving a reference to a variable (a-d), a long-range dependency (e-h), and the attention weights of the Sparse Pointer Network (i).
  • Figure 4: Full example of code suggestion with a Sparse Pointer Network. Boldface tokens on the left show the first declaration of an identifier. The middle part visualizes the memory of representations of these identifiers. The right part visualizes the output $\bm{\lambda}$ of the controller, which is used for interpolating between the language model (LM) and the attention of the pointer network (Att).