CausalARC: Abstract Reasoning with Causal World Models
Jacqueline Maasch, John Kalantari, Kia Khezeli
TL;DR
CausalARC addresses the challenge of evaluating abstract and counterfactual reasoning in language models under limited data and distribution shift by sampling ARC-like tasks from fully specified structural causal models. It defines a causal world-model framework where tasks are generated as SCMs, enabling observational, interventional, and counterfactual queries aligned with the Pearl Causal Hierarchy, and provides a static dataset and codebase for task generation. The work demonstrates four LM-evaluation settings—abstract reasoning with test-time training, counterfactual reasoning via in-context learning, program synthesis, and causal discovery with logical reasoning—revealing substantial variability across tasks and models and indicating clear opportunities to improve LM reasoning capabilities. Overall, CausalARC offers a rigorous, causally grounded benchmark to study reasoning under distributional shifts, with implications for robust abstract reasoning, causal inference, and automated program synthesis in AI systems.
Abstract
On-the-fly reasoning often requires adaptation to novel problems under limited data and distribution shift. This work introduces CausalARC: an experimental testbed for AI reasoning in low-data and out-of-distribution regimes, modeled after the Abstraction and Reasoning Corpus (ARC). Each CausalARC reasoning task is sampled from a fully specified causal world model, formally expressed as a structural causal model. Principled data augmentations provide observational, interventional, and counterfactual feedback about the world model in the form of few-shot, in-context learning demonstrations. As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4) causal discovery with logical reasoning. Within- and between-model performance varied heavily across tasks, indicating room for significant improvement in language model reasoning.
