Table of Contents
Fetching ...

Retrieval-Based Neural Code Generation

Shirley Anugrah Hayati, Raphael Olivier, Pravalika Avvaru, Pengcheng Yin, Anthony Tomasic, Graham Neubig

TL;DR

This work tackles natural language to code generation with guarantees of syntactic correctness by generating code as an Abstract Syntax Tree (AST). It introduces ReCode, a retrieval-based method that uses similar NL descriptions to fetch action-subtrees from training data, then biases a neural AST generator through word-aligned substitutions and decoding-time scoring. Key components include a dynamic-programming-based sentence similarity for retrieval, extraction of $n$-gram action subtrees, word-substitution for Copy actions, and retrieval-guided decoding with additive log-probability boosts $\lambda \cdot \text{score}(u)$. Evaluated on Hearthstone and Django datasets, ReCode achieves BLEU improvements up to $+2.6$ and higher exact-match accuracy than strong baselines, demonstrating that retrieval-augmented tree generation can significantly enhance syntactic code generation and potentially generalize to other structured prediction tasks.

Abstract

In models to generate program source code from natural language, representing this code in a tree structure has been a common approach. However, existing methods often fail to generate complex code correctly due to a lack of ability to memorize large and complex structures. We introduce ReCode, a method based on subtree retrieval that makes it possible to explicitly reference existing code examples within a neural code generation model. First, we retrieve sentences that are similar to input sentences using a dynamic-programming-based sentence similarity scoring method. Next, we extract n-grams of action sequences that build the associated abstract syntax tree. Finally, we increase the probability of actions that cause the retrieved n-gram action subtree to be in the predicted code. We show that our approach improves the performance on two code generation tasks by up to +2.6 BLEU.

Retrieval-Based Neural Code Generation

TL;DR

This work tackles natural language to code generation with guarantees of syntactic correctness by generating code as an Abstract Syntax Tree (AST). It introduces ReCode, a retrieval-based method that uses similar NL descriptions to fetch action-subtrees from training data, then biases a neural AST generator through word-aligned substitutions and decoding-time scoring. Key components include a dynamic-programming-based sentence similarity for retrieval, extraction of -gram action subtrees, word-substitution for Copy actions, and retrieval-guided decoding with additive log-probability boosts . Evaluated on Hearthstone and Django datasets, ReCode achieves BLEU improvements up to and higher exact-match accuracy than strong baselines, demonstrating that retrieval-augmented tree generation can significantly enhance syntactic code generation and potentially generalize to other structured prediction tasks.

Abstract

In models to generate program source code from natural language, representing this code in a tree structure has been a common approach. However, existing methods often fail to generate complex code correctly due to a lack of ability to memorize large and complex structures. We introduce ReCode, a method based on subtree retrieval that makes it possible to explicitly reference existing code examples within a neural code generation model. First, we retrieve sentences that are similar to input sentences using a dynamic-programming-based sentence similarity scoring method. Next, we extract n-grams of action sequences that build the associated abstract syntax tree. Finally, we increase the probability of actions that cause the retrieved n-gram action subtree to be in the predicted code. We show that our approach improves the performance on two code generation tasks by up to +2.6 BLEU.

Paper Structure

This paper contains 13 sections, 3 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The action sequence used to generate AST for the target code given the input example. Dashed nodes represent terminals. Each node is labeled with time steps. ApplyRule action is represented as rule in this figure. Blue dotted boxes denote 3-gram action subtrees. Italic words are unedited words. Red bold words are different object names.