Table of Contents
Fetching ...

ARC-GEN: A Mimetic Procedural Benchmark Generator for the Abstraction and Reasoning Corpus

Michael D. Moffitt

TL;DR

ARC-GEN presents an open-source procedural benchmark generator that exhaustively covers the ARC-AGI-1 task set while mirroring its distributional properties. The approach hinges on per-task parameterization, pixel-perfect generation, and per-task validation to ensure faithful reproduction of original examples and robust testing of solvers. Key contributions include the generate/validate/variation framework, empirical validation against existing generators, and a concrete application to the 2025 Google Code Golf Championship to foster a broad, verifiable challenge corpus. The work advances reproducible, mimetic benchmarking for abstraction and reasoning, with implications for evaluating generalization and guiding future ARC-AGI iterations toward more capable reasoning systems.

Abstract

The Abstraction and Reasoning Corpus remains one of the most compelling and challenging benchmarks for tracking progress toward achieving Artificial General Intelligence. In contrast to other evaluation datasets designed to assess an agent's task-specific skills or accumulated knowledge, the ARC-AGI suite is specifically targeted at measuring skill acquisition efficiency, a trait that has (so far) been lacking in even the most sophisticated machine learning systems. For algorithms that require extensive intra-task exemplars, a significant constraint imposed by ARC-AGI is the modest cardinality of its demonstration set, comprising a small number of $\langle$ input, output $\rangle$ grids per task specifying the corresponding transformation. To embellish the space of viable sample pairs, this paper introduces ARC-GEN, an open-source procedural generator aimed at extending the original ARC-AGI training dataset as faithfully as possible. Unlike prior efforts, our generator is both exhaustive (covering all four-hundred tasks) and mimetic (more closely honoring the distributional properties and characteristics embodied in the initial ARC-AGI-1 release). We also discuss the use of this generator in establishing a static benchmark suite to verify the correctness of programs submitted to the 2025 Google Code Golf Championship.

ARC-GEN: A Mimetic Procedural Benchmark Generator for the Abstraction and Reasoning Corpus

TL;DR

ARC-GEN presents an open-source procedural benchmark generator that exhaustively covers the ARC-AGI-1 task set while mirroring its distributional properties. The approach hinges on per-task parameterization, pixel-perfect generation, and per-task validation to ensure faithful reproduction of original examples and robust testing of solvers. Key contributions include the generate/validate/variation framework, empirical validation against existing generators, and a concrete application to the 2025 Google Code Golf Championship to foster a broad, verifiable challenge corpus. The work advances reproducible, mimetic benchmarking for abstraction and reasoning, with implications for evaluating generalization and guiding future ARC-AGI iterations toward more capable reasoning systems.

Abstract

The Abstraction and Reasoning Corpus remains one of the most compelling and challenging benchmarks for tracking progress toward achieving Artificial General Intelligence. In contrast to other evaluation datasets designed to assess an agent's task-specific skills or accumulated knowledge, the ARC-AGI suite is specifically targeted at measuring skill acquisition efficiency, a trait that has (so far) been lacking in even the most sophisticated machine learning systems. For algorithms that require extensive intra-task exemplars, a significant constraint imposed by ARC-AGI is the modest cardinality of its demonstration set, comprising a small number of input, output grids per task specifying the corresponding transformation. To embellish the space of viable sample pairs, this paper introduces ARC-GEN, an open-source procedural generator aimed at extending the original ARC-AGI training dataset as faithfully as possible. Unlike prior efforts, our generator is both exhaustive (covering all four-hundred tasks) and mimetic (more closely honoring the distributional properties and characteristics embodied in the initial ARC-AGI-1 release). We also discuss the use of this generator in establishing a static benchmark suite to verify the correctness of programs submitted to the 2025 Google Code Golf Championship.

Paper Structure

This paper contains 16 sections, 5 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: A natural language description of an ARC-AGI-1 puzzle (ID: 543a7ed5).
  • Figure 2: All examples in the original ARC-AGI-1 benchmark suite for puzzle ID 543a7ed5.
  • Figure 3: Examples for puzzle ID 543a7ed5 produced by the RE-ARC procedural generator.
  • Figure 4: Our procedural generation code for just one of the four-hundred puzzles (ID: 543a7ed5).
  • Figure 5: Examples for puzzle ID 543a7ed5 produced by our ARC-GEN procedural generator.
  • ...and 3 more figures