ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis
Kensen Shi, Joey Hong, Yinlin Deng, Pengcheng Yin, Manzil Zaheer, Charles Sutton
TL;DR
The paper addresses the challenge of compositional generalization in neural program synthesis by proposing ExeDec, an execution-space decomposition approach that predicts execution subgoals and synthesizes subprograms iteratively. It introduces a meta-benchmark across RobustFill and DeepCoder to rigorously evaluate generalization beyond i.i.d. data and demonstrates that ExeDec substantially improves compositional generalization over baselines in both Transformer-trained-from-scratch and few-shot LLM settings. Key findings show up to a 2×–4× rise in compositional generalization performance and notable gains in end-to-end task success, though LLMs still struggle with compositionally novel tasks. The work advances the field by combining planning-like subgoal reasoning with execution-guided synthesis and provides datasets and prompts to support ongoing progress toward robust compositional generalization in neural program synthesis.
Abstract
When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, we can measure whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more complex tasks. In this paper, we characterize several different forms of compositional generalization that are desirable in program synthesis, forming a meta-benchmark which we use to create generalization tasks for two popular datasets, RobustFill and DeepCoder. We then propose ExeDec, a novel decomposition-based synthesis strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step. When used with Transformer models trained from scratch, ExeDec has better synthesis performance and greatly improved compositional generalization ability compared to baselines. Finally, we use our benchmarks to demonstrate that LLMs struggle to compositionally generalize when asked to do programming-by-example in a few-shot setting, but an ExeDec-style prompting approach can improve the generalization ability and overall performance.
