Table of Contents
Fetching ...

ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis

Kensen Shi, Joey Hong, Yinlin Deng, Pengcheng Yin, Manzil Zaheer, Charles Sutton

TL;DR

The paper addresses the challenge of compositional generalization in neural program synthesis by proposing ExeDec, an execution-space decomposition approach that predicts execution subgoals and synthesizes subprograms iteratively. It introduces a meta-benchmark across RobustFill and DeepCoder to rigorously evaluate generalization beyond i.i.d. data and demonstrates that ExeDec substantially improves compositional generalization over baselines in both Transformer-trained-from-scratch and few-shot LLM settings. Key findings show up to a 2×–4× rise in compositional generalization performance and notable gains in end-to-end task success, though LLMs still struggle with compositionally novel tasks. The work advances the field by combining planning-like subgoal reasoning with execution-guided synthesis and provides datasets and prompts to support ongoing progress toward robust compositional generalization in neural program synthesis.

Abstract

When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, we can measure whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more complex tasks. In this paper, we characterize several different forms of compositional generalization that are desirable in program synthesis, forming a meta-benchmark which we use to create generalization tasks for two popular datasets, RobustFill and DeepCoder. We then propose ExeDec, a novel decomposition-based synthesis strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step. When used with Transformer models trained from scratch, ExeDec has better synthesis performance and greatly improved compositional generalization ability compared to baselines. Finally, we use our benchmarks to demonstrate that LLMs struggle to compositionally generalize when asked to do programming-by-example in a few-shot setting, but an ExeDec-style prompting approach can improve the generalization ability and overall performance.

ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis

TL;DR

The paper addresses the challenge of compositional generalization in neural program synthesis by proposing ExeDec, an execution-space decomposition approach that predicts execution subgoals and synthesizes subprograms iteratively. It introduces a meta-benchmark across RobustFill and DeepCoder to rigorously evaluate generalization beyond i.i.d. data and demonstrates that ExeDec substantially improves compositional generalization over baselines in both Transformer-trained-from-scratch and few-shot LLM settings. Key findings show up to a 2×–4× rise in compositional generalization performance and notable gains in end-to-end task success, though LLMs still struggle with compositionally novel tasks. The work advances the field by combining planning-like subgoal reasoning with execution-guided synthesis and provides datasets and prompts to support ongoing progress toward robust compositional generalization in neural program synthesis.

Abstract

When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, we can measure whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more complex tasks. In this paper, we characterize several different forms of compositional generalization that are desirable in program synthesis, forming a meta-benchmark which we use to create generalization tasks for two popular datasets, RobustFill and DeepCoder. We then propose ExeDec, a novel decomposition-based synthesis strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step. When used with Transformer models trained from scratch, ExeDec has better synthesis performance and greatly improved compositional generalization ability compared to baselines. Finally, we use our benchmarks to demonstrate that LLMs struggle to compositionally generalize when asked to do programming-by-example in a few-shot setting, but an ExeDec-style prompting approach can improve the generalization ability and overall performance.
Paper Structure (25 sections, 3 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 3 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: Our five compositional generalization tasks. Circles represent subprograms that join to form programs as train or test examples, colored circles represent subprograms of a particular concept or operation, and bold outlines represent analogous functionality of different operations.
  • Figure 2: Compositional generalization results with beam size 10. Error bars denote 95% confidence intervals of the mean across 5 trials. On both datasets, ExeDec generalizes better than the no-subgoal ablation, while both decomposition variations greatly outperform the Transformer baseline.
  • Figure 3: The DSL for string transformation tasks in the RobustFill domain, slightly modified from ROBUSTFILL to add more functionality.
  • Figure 4: The DSL for integer and list manipulation tasks in the DeepCoder domain, originally proposed in DEEPCODER.
  • Figure 5: A comparison of different approaches on the same string manipulation problem in the RobustFill domain, under the Compose-New-Operation generalization task. ExeDec is able to solve the problem correctly with a length 5 program including two usages of the new operation (Compose). The no-subgoal ablation fails to correctly use the Compose operation in step 3, likely because the model has not seen the Compose operation used to produce a prefix of the output. On the other hand, ExeDec succeeds in that step because the relevant prefixes are predicted as subgoals first. The Transformer baseline performs poorly on this task and does not use a single Compose operation.
  • ...and 6 more figures