Table of Contents
Fetching ...

Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI

Julien Pourcel, Cédric Colas, Pierre-Yves Oudeyer

Abstract

Many program synthesis tasks prove too challenging for even state-of-the-art language models to solve in single attempts. Search-based evolutionary methods offer a promising alternative by exploring solution spaces iteratively, but their effectiveness remain limited by the fixed capabilities of the underlying generative model. We propose SOAR, a method that learns program synthesis by integrating language models into a self-improving evolutionary loop. SOAR alternates between (1) an evolutionary search that uses an LLM to sample and refine candidate solutions, and (2) a hindsight learning phase that converts search attempts into valid problem-solution pairs used to fine-tune the LLM's sampling and refinement capabilities\, -- \,enabling increasingly effective search in subsequent iterations. On the challenging ARC-AGI benchmark, SOAR achieves significant performance gains across model scales and iterations, leveraging positive transfer between the sampling and refinement finetuning tasks. These improvements carry over to test-time adaptation, enabling SOAR to solve 52\% of the public test set. Our code is open-sourced at: https://github.com/flowersteam/SOAR

Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI

Abstract

Many program synthesis tasks prove too challenging for even state-of-the-art language models to solve in single attempts. Search-based evolutionary methods offer a promising alternative by exploring solution spaces iteratively, but their effectiveness remain limited by the fixed capabilities of the underlying generative model. We propose SOAR, a method that learns program synthesis by integrating language models into a self-improving evolutionary loop. SOAR alternates between (1) an evolutionary search that uses an LLM to sample and refine candidate solutions, and (2) a hindsight learning phase that converts search attempts into valid problem-solution pairs used to fine-tune the LLM's sampling and refinement capabilities\, -- \,enabling increasingly effective search in subsequent iterations. On the challenging ARC-AGI benchmark, SOAR achieves significant performance gains across model scales and iterations, leveraging positive transfer between the sampling and refinement finetuning tasks. These improvements carry over to test-time adaptation, enabling SOAR to solve 52\% of the public test set. Our code is open-sourced at: https://github.com/flowersteam/SOAR

Paper Structure

This paper contains 49 sections, 6 equations, 11 figures, 7 tables, 3 algorithms.

Figures (11)

  • Figure 1: Overview of the SOAR architecture solving a task from the Abstract Reasoning Corpus. Each ARC task implicitly encodes for grid transformation $\hat{f}$ demonstrated via examples $\{x_\text{train},y_\text{train}\}$ such that $\hat{f}(x_\text{train})=y_\text{train}$. To solve a task, one must find the output grids $y_\text{test}$ corresponding to test input grids $x_\text{test}$. SOAR learns to synthesize transformation programs $f$ in Python by alternating between an evolutionary search phase (using an LLM as search operator for sampling and refining programs) and a learning phase where the LLM is finetuned with hindsight learning, improving its efficiency in sampling and refining programs in the evolutionary loop --- eventually solving 52% of ARC-AGI public test set.
  • Figure 2: Iterated self-improvement on training problems. ARC-train performance across training iterations. Training iteration 0: search with the base model. All: score achieved by applying majority voting on the combined generated solutions of the five models.
  • Figure 3: Iterated self-improvement on test problems. ARC-test performance across test-time training iterations. Iteration 0: search with the models finetuned in training iteration 4 (right-most points in Figure \ref{['fig:iterations_train']}). All: score achieved by applying majority voting on the combined search data of the five models.
  • Figure 4: Performance plateaus with increasing model size when using fixed sampling and refinement capabilities (SOAR (no learning)). In contrast, SOAR progressively lifts the scaling curves across iterations, enabling smaller models to match or outperform much larger ones. Note that only the 7B, 14B, and 32B models are from the same family (Qwen-2.5-Coder), 72B is from the Qwen-2.5 family, and 123B is Mistral-large-2407. One-shot results for closed-source LLM are shown (sample 1 time)
  • Figure 5: Search alone hits diminishing returns with increased budget: the base 7B model (SOAR (no learning) with only the search phase of SOAR) plateaus after about 5.2k search attempts. SOAR outperforms this baseline by a wide margin, with improvements compounding across iterations.
  • ...and 6 more figures