Table of Contents
Fetching ...

CellARC: Measuring Intelligence with Cellular Automata

Miroslav Lžičař

TL;DR

CellARC proposes a controllable, synthetic benchmark that isolates the ability to discover local CA rules from compact supervision, enabling rapid, apples-to-apples evaluation across symbolic and neural methods. By grounding tasks in one-dimensional multicolor cellular automata and constraining episodes to ≤$256$ tokens, the framework decouples generalization from anthropocentric priors and provides explicit difficulty knobs ($k$, $r$, $ ext{lambda}$, and $ ext{H}$) with coverage diagnostics. Empirically, a vanilla encoder-only Transformer with task embeddings delivers the best open-model performance, while a De Bruijn symbolic solver remains competitive under high coverage; a closed LLM (GPT-5 High) excels on extrapolation in low-coverage regimes, underscoring the value of pretraining. The framework exposes neural-symbolic complementarity, capacity scaling effects, and overfitting patterns in recursive models, offering practical benchmarking advantages and a clear path toward extensions to 2D CA, richer automata, and interactive tasks. Overall, CellARC provides a precise, efficient instrument for studying how quickly models infer new rules under tight budgets, informing future designs for rapid adaptation and generalization‑driven intelligence.

Abstract

We introduce CellARC, a synthetic benchmark for abstraction and reasoning built from multicolor 1D cellular automata (CA). Each episode has five support pairs and one query serialized in 256 tokens, enabling rapid iteration with small models while exposing a controllable task space with explicit knobs for alphabet size k, radius r, rule family, Langton's lambda, query coverage, and cell entropy. We release 95k training episodes plus two 1k test splits (interpolation/extrapolation) and evaluate symbolic, recurrent, convolutional, transformer, recursive, and LLM baselines. CellARC decouples generalization from anthropomorphic priors, supports unlimited difficulty-controlled sampling, and enables reproducible studies of how quickly models infer new rules under tight budgets. Our strongest small-model baseline (a 10M-parameter vanilla transformer) outperforms recent recursive models (TRM, HRM), reaching 58.0%/32.4% per-token accuracy on the interpolation/extrapolation splits, while a large closed model (GPT-5 High) attains 62.3%/48.1% on subsets of 100 test tasks. An ensemble that chooses per episode between the Transformer and the best symbolic baseline reaches 65.4%/35.5%, highlighting neuro-symbolic complementarity. Leaderboard: https://cellarc.mireklzicar.com

CellARC: Measuring Intelligence with Cellular Automata

TL;DR

CellARC proposes a controllable, synthetic benchmark that isolates the ability to discover local CA rules from compact supervision, enabling rapid, apples-to-apples evaluation across symbolic and neural methods. By grounding tasks in one-dimensional multicolor cellular automata and constraining episodes to ≤ tokens, the framework decouples generalization from anthropocentric priors and provides explicit difficulty knobs (, , , and ) with coverage diagnostics. Empirically, a vanilla encoder-only Transformer with task embeddings delivers the best open-model performance, while a De Bruijn symbolic solver remains competitive under high coverage; a closed LLM (GPT-5 High) excels on extrapolation in low-coverage regimes, underscoring the value of pretraining. The framework exposes neural-symbolic complementarity, capacity scaling effects, and overfitting patterns in recursive models, offering practical benchmarking advantages and a clear path toward extensions to 2D CA, richer automata, and interactive tasks. Overall, CellARC provides a precise, efficient instrument for studying how quickly models infer new rules under tight budgets, informing future designs for rapid adaptation and generalization‑driven intelligence.

Abstract

We introduce CellARC, a synthetic benchmark for abstraction and reasoning built from multicolor 1D cellular automata (CA). Each episode has five support pairs and one query serialized in 256 tokens, enabling rapid iteration with small models while exposing a controllable task space with explicit knobs for alphabet size k, radius r, rule family, Langton's lambda, query coverage, and cell entropy. We release 95k training episodes plus two 1k test splits (interpolation/extrapolation) and evaluate symbolic, recurrent, convolutional, transformer, recursive, and LLM baselines. CellARC decouples generalization from anthropomorphic priors, supports unlimited difficulty-controlled sampling, and enables reproducible studies of how quickly models infer new rules under tight budgets. Our strongest small-model baseline (a 10M-parameter vanilla transformer) outperforms recent recursive models (TRM, HRM), reaching 58.0%/32.4% per-token accuracy on the interpolation/extrapolation splits, while a large closed model (GPT-5 High) attains 62.3%/48.1% on subsets of 100 test tasks. An ensemble that chooses per episode between the Transformer and the best symbolic baseline reaches 65.4%/35.5%, highlighting neuro-symbolic complementarity. Leaderboard: https://cellarc.mireklzicar.com

Paper Structure

This paper contains 93 sections, 5 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Example of four multicolor 1D CA rules from CellARC\ref{['sym:split_test_e']}. Top-left: outer--inner--totalistic , $\ref{['sym:lambda']}=0.310$, $\ref{['sym:H']}=1.23$; top-right: linear-mod-\ref{['sym:k']} , $\ref{['sym:lambda']}=0.833$, $\ref{['sym:H']}=2.57$; bottom-left: outer-totalistic , $\ref{['sym:lambda']}=0.500$, $\ref{['sym:H']}=1.82$; bottom-right: totalistic, $\ref{['sym:lambda']}=0.808$, $\ref{['sym:H']}=1.53$.
  • Figure 2: Example CellARC training episode. The episode contains five input--output pairs (sequences of digits, shown as colors) and one query--solution pair. All pairs are obtained by unrolling a cellular automaton and extracting consecutive patches separated by a fixed step size (gap). Extracted patches are demarcated with white cross-hatching. Models are expected to infer the underlying pattern and learn to predict the next step from the regularities observed in the examples. Model input includes I/O pairs and the query I, but not the query S; predictions are scored on S.
  • Figure 3: Pigmentation patterns of natural mollusc shells explainable using cellular automata.kusch1996mollusc
  • Figure 4: Distribution of CA rule families across training, validation, and test splits. The dataset exhibits diverse family composition (train: Random $\sim$25.5%, Totalistic $\sim$24.1%, Outer Inner Totalistic $\sim$18.9%, Outer Totalistic $\sim$18.9%, Threshold $\sim$12.1%, Linear mod(\ref{['sym:k']}) $\sim$0.6%). Notably, the extrapolation split is concentrated in Totalistic (86.8%) and Linear mod(\ref{['sym:k']}) (9.4%) families due to the highest-\ref{['sym:lambda']}/highest-entropy filtering (see text).
  • Figure 5: Distribution of Langton's \ref{['sym:lambda']} regimes across splits. The dataset is dominated by Chaotic ($\ref{['sym:lambda']} > 0.5$; 59.4%) and Edge-of-chaos ($0.4 < \ref{['sym:lambda']} \leq 0.5$; 35.8%) rules, with a small fraction of Ordered ($\ref{['sym:lambda']} \leq 0.4$; 4.8%) dynamics. The extrapolation split now consists entirely of chaotic episodes (100%), removing edge-of-chaos and ordered rules to stress generalization to the highest-entropy CA.
  • ...and 6 more figures