Table of Contents
Fetching ...

Neural Program Synthesis with Priority Queue Training

Daniel A. Abolafia, Mohammad Norouzi, Jonathan Shen, Rui Zhao, Quoc V. Le

TL;DR

The paper tackles reward-based automatic program synthesis using an RNN generator trained under two schemes: policy gradient (REINFORCE) and Priority Queue Training (PQT), where a top-$K$ buffer of best programs guides learning. PQT, optionally combined with PG, outperforms a genetic algorithm and vanilla PG on BF language tasks, and the authors show that adding a program length penalty yields shorter, human-readable solutions. The BF benchmark, with its simple syntax and Turing-completeness, serves to demonstrate a stable, scalable approach that can bootstrap exploration from scratch via the top-$K$ buffer. These findings suggest that a compact, off-policy training regime with a small priority queue can effectively drive neural program synthesis, offering a basis for transfer learning in expressive programming environments.

Abstract

We consider the task of program synthesis in the presence of a reward function over the output of programs, where the goal is to find programs with maximal rewards. We employ an iterative optimization scheme, where we train an RNN on a dataset of K best programs from a priority queue of the generated programs so far. Then, we synthesize new programs and add them to the priority queue by sampling from the RNN. We benchmark our algorithm, called priority queue training (or PQT), against genetic algorithm and reinforcement learning baselines on a simple but expressive Turing complete programming language called BF. Our experimental results show that our simple PQT algorithm significantly outperforms the baselines. By adding a program length penalty to the reward function, we are able to synthesize short, human readable programs.

Neural Program Synthesis with Priority Queue Training

TL;DR

The paper tackles reward-based automatic program synthesis using an RNN generator trained under two schemes: policy gradient (REINFORCE) and Priority Queue Training (PQT), where a top- buffer of best programs guides learning. PQT, optionally combined with PG, outperforms a genetic algorithm and vanilla PG on BF language tasks, and the authors show that adding a program length penalty yields shorter, human-readable solutions. The BF benchmark, with its simple syntax and Turing-completeness, serves to demonstrate a stable, scalable approach that can bootstrap exploration from scratch via the top- buffer. These findings suggest that a compact, off-policy training regime with a small priority queue can effectively drive neural program synthesis, offering a basis for transfer learning in expressive programming environments.

Abstract

We consider the task of program synthesis in the presence of a reward function over the output of programs, where the goal is to find programs with maximal rewards. We employ an iterative optimization scheme, where we train an RNN on a dataset of K best programs from a priority queue of the generated programs so far. Then, we synthesize new programs and add them to the priority queue by sampling from the RNN. We benchmark our algorithm, called priority queue training (or PQT), against genetic algorithm and reinforcement learning baselines on a simple but expressive Turing complete programming language called BF. Our experimental results show that our simple PQT algorithm significantly outperforms the baselines. By adding a program length penalty to the reward function, we are able to synthesize short, human readable programs.

Paper Structure

This paper contains 21 sections, 6 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: An overview of our synthesizer. The synthesizer is an RNN, which generates the program in an autoregressive fashion.
  • Figure 2: In the following figure we step through a BF program that reverses a given list. The target list is loaded into the input buffer, and the programs output will be written to the output buffer. Each row depicts the state of the program and memory before executing that step. Purple indicates that some action will be taken when the current step is executed. We skip some steps which are easy to infer. Vertical ellipses indicate the continuation of a loop until its completion.