CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay

Natasha Butt; Blazej Manczak; Auke Wiggers; Corrado Rainone; David W. Zhang; Michaël Defferrard; Taco Cohen

CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay

Natasha Butt, Blazej Manczak, Auke Wiggers, Corrado Rainone, David W. Zhang, Michaël Defferrard, Taco Cohen

TL;DR

CodeIt tackles the Abstraction and Reasoning Corpus (ARC), a benchmark where general-purpose large language models struggle due to sparse rewards and reliance on priors. It introduces CodeIt, a scalable self-improvement framework that treats ARC as programming-by-examples, combining an expert-iteration loop with hindsight replay and prioritized experience replay to train a CodeT5+-based policy within a domain-specific language. The approach achieves state-of-the-art results on the full ARC eval set, solving 59 of 400 tasks, and demonstrates that programs become shorter over time and can be refined through subsequent iterations. These findings suggest that integrating priors from DSLs and LLMs with experience-driven learning enables efficient inter-task generalization and scalable neuro-symbolic reasoning on complex symbolic tasks.

Abstract

Large language models are increasingly solving tasks that are commonly believed to require human-level reasoning ability. However, these models still perform very poorly on benchmarks of general intelligence such as the Abstraction and Reasoning Corpus (ARC). In this paper, we approach ARC as a programming-by-examples problem, and introduce a novel and scalable method for language model self-improvement called Code Iteration (CodeIt). Our method iterates between 1) program sampling and hindsight relabeling, and 2) learning from prioritized experience replay. By relabeling the goal of an episode (i.e., the target program output given input) to the realized output produced by the sampled program, our method effectively deals with the extreme sparsity of rewards in program synthesis. Applying CodeIt to the ARC dataset, we demonstrate that prioritized hindsight replay, along with pre-training and data-augmentation, leads to successful inter-task generalization. CodeIt is the first neuro-symbolic approach that scales to the full ARC evaluation dataset. Our method solves 15% of ARC evaluation tasks, achieving state-of-the-art performance and outperforming existing neural and symbolic baselines. Our code is available at https://github.com/Qualcomm-AI-research/codeit .

CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay

TL;DR

Abstract

Paper Structure (55 sections, 1 equation, 16 figures, 7 tables, 3 algorithms)

This paper contains 55 sections, 1 equation, 16 figures, 7 tables, 3 algorithms.

Introduction
Method
Design choices
Programming language
Policy
Grid representation
The Code Iteration Algorithm
Initialization
Sampling and hindsight relabeling
Learning
Experiments
Custom baselines
Baselines from literature
Setup
Main results on ARC eval set
...and 40 more sections

Figures (16)

Figure 1: An overview of Code Iteration. In the sampling stage, programs $\rho$ are sampled from the policy $Q_\theta$ conditioned on input-output pairs. The program may not produce target output $O^*$ given $I$, so we use hindsight relabeling: we execute the program, and add the program $\rho$, inputs $I$, and realized outputs $O$ to the buffer. In the learning stage, we train the policy on samples from the buffer.
Figure 2: A simplified ARC task. Given two demonstration input-output pairs, the goal is to determine the output grid for the test example, in three attempts or fewer. The size of the grids and the number of demonstration and test examples differs across tasks.
Figure 3: Sparse grid representation of a simplified ARC task.
Figure 4: Cumulative performance as function of number of sampled programs for CodeIt and various baselines, showing mean and standard deviation of three runs for CodeIt and custom baselines.
Figure 5: ARC evaluation task 48f8583b and the solution program found by CodeIt.
...and 11 more figures

CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay

TL;DR

Abstract

CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay

Authors

TL;DR

Abstract

Table of Contents

Figures (16)