Neural Program Synthesis with Priority Queue Training
Daniel A. Abolafia, Mohammad Norouzi, Jonathan Shen, Rui Zhao, Quoc V. Le
TL;DR
The paper tackles reward-based automatic program synthesis using an RNN generator trained under two schemes: policy gradient (REINFORCE) and Priority Queue Training (PQT), where a top-$K$ buffer of best programs guides learning. PQT, optionally combined with PG, outperforms a genetic algorithm and vanilla PG on BF language tasks, and the authors show that adding a program length penalty yields shorter, human-readable solutions. The BF benchmark, with its simple syntax and Turing-completeness, serves to demonstrate a stable, scalable approach that can bootstrap exploration from scratch via the top-$K$ buffer. These findings suggest that a compact, off-policy training regime with a small priority queue can effectively drive neural program synthesis, offering a basis for transfer learning in expressive programming environments.
Abstract
We consider the task of program synthesis in the presence of a reward function over the output of programs, where the goal is to find programs with maximal rewards. We employ an iterative optimization scheme, where we train an RNN on a dataset of K best programs from a priority queue of the generated programs so far. Then, we synthesize new programs and add them to the priority queue by sampling from the RNN. We benchmark our algorithm, called priority queue training (or PQT), against genetic algorithm and reinforcement learning baselines on a simple but expressive Turing complete programming language called BF. Our experimental results show that our simple PQT algorithm significantly outperforms the baselines. By adding a program length penalty to the reward function, we are able to synthesize short, human readable programs.
