CellARC: Measuring Intelligence with Cellular Automata
Miroslav Lžičař
TL;DR
CellARC proposes a controllable, synthetic benchmark that isolates the ability to discover local CA rules from compact supervision, enabling rapid, apples-to-apples evaluation across symbolic and neural methods. By grounding tasks in one-dimensional multicolor cellular automata and constraining episodes to ≤$256$ tokens, the framework decouples generalization from anthropocentric priors and provides explicit difficulty knobs ($k$, $r$, $ ext{lambda}$, and $ ext{H}$) with coverage diagnostics. Empirically, a vanilla encoder-only Transformer with task embeddings delivers the best open-model performance, while a De Bruijn symbolic solver remains competitive under high coverage; a closed LLM (GPT-5 High) excels on extrapolation in low-coverage regimes, underscoring the value of pretraining. The framework exposes neural-symbolic complementarity, capacity scaling effects, and overfitting patterns in recursive models, offering practical benchmarking advantages and a clear path toward extensions to 2D CA, richer automata, and interactive tasks. Overall, CellARC provides a precise, efficient instrument for studying how quickly models infer new rules under tight budgets, informing future designs for rapid adaptation and generalization‑driven intelligence.
Abstract
We introduce CellARC, a synthetic benchmark for abstraction and reasoning built from multicolor 1D cellular automata (CA). Each episode has five support pairs and one query serialized in 256 tokens, enabling rapid iteration with small models while exposing a controllable task space with explicit knobs for alphabet size k, radius r, rule family, Langton's lambda, query coverage, and cell entropy. We release 95k training episodes plus two 1k test splits (interpolation/extrapolation) and evaluate symbolic, recurrent, convolutional, transformer, recursive, and LLM baselines. CellARC decouples generalization from anthropomorphic priors, supports unlimited difficulty-controlled sampling, and enables reproducible studies of how quickly models infer new rules under tight budgets. Our strongest small-model baseline (a 10M-parameter vanilla transformer) outperforms recent recursive models (TRM, HRM), reaching 58.0%/32.4% per-token accuracy on the interpolation/extrapolation splits, while a large closed model (GPT-5 High) attains 62.3%/48.1% on subsets of 100 test tasks. An ensemble that chooses per episode between the Transformer and the best symbolic baseline reaches 65.4%/35.5%, highlighting neuro-symbolic complementarity. Leaderboard: https://cellarc.mireklzicar.com
