Table of Contents
Fetching ...

Structural Language Models of Code

Uri Alon, Roy Sadaka, Omer Levy, Eran Yahav

TL;DR

The paper introduces Structural Language Models of Code (SLM) to tackle any-code completion by modeling code as an Abstract Syntax Tree and decomposing its probability over nodes. A neural model evaluates conditional probabilities across all AST paths to a target node, enabling syntax-aware generation of arbitrary code across languages. It reportedly outperforms seq2seq and other structured methods on Java and C# code, demonstrating robust code generation capabilities. The authors also release code, data, trained models, and an online demo to facilitate adoption and further research.

Abstract

We address the problem of any-code completion - generating a missing piece of source code in a given program without any restriction on the vocabulary or structure. We introduce a new approach to any-code completion that leverages the strict syntax of programming languages to model a code snippet as a tree - structural language modeling (SLM). SLM estimates the probability of the program's abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node. Unlike previous techniques that have severely restricted the kinds of expressions that can be generated in this task, our approach can generate arbitrary code in any programming language. Our model significantly outperforms both seq2seq and a variety of structured approaches in generating Java and C# code. Our code, data, and trained models are available at http://github.com/tech-srl/slm-code-generation/ . An online demo is available at http://AnyCodeGen.org .

Structural Language Models of Code

TL;DR

The paper introduces Structural Language Models of Code (SLM) to tackle any-code completion by modeling code as an Abstract Syntax Tree and decomposing its probability over nodes. A neural model evaluates conditional probabilities across all AST paths to a target node, enabling syntax-aware generation of arbitrary code across languages. It reportedly outperforms seq2seq and other structured methods on Java and C# code, demonstrating robust code generation capabilities. The authors also release code, data, trained models, and an online demo to facilitate adoption and further research.

Abstract

We address the problem of any-code completion - generating a missing piece of source code in a given program without any restriction on the vocabulary or structure. We introduce a new approach to any-code completion that leverages the strict syntax of programming languages to model a code snippet as a tree - structural language modeling (SLM). SLM estimates the probability of the program's abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node. Unlike previous techniques that have severely restricted the kinds of expressions that can be generated in this task, our approach can generate arbitrary code in any programming language. Our model significantly outperforms both seq2seq and a variety of structured approaches in generating Java and C# code. Our code, data, and trained models are available at http://github.com/tech-srl/slm-code-generation/ . An online demo is available at http://AnyCodeGen.org .

Paper Structure

This paper contains 19 sections, 1 equation, 1 figure, 1 table, 1 algorithm.

Figures (1)

  • Figure 1: Historical locations and number of accepted papers for International Machine Learning Conferences (ICML 1993 -- ICML 2008) and International Workshops on Machine Learning (ML 1988 -- ML 1992). At the time this figure was produced, the number of accepted papers for ICML 2008 was unknown and instead estimated.