Table of Contents
Fetching ...

CausalARC: Abstract Reasoning with Causal World Models

Jacqueline Maasch, John Kalantari, Kia Khezeli

TL;DR

CausalARC addresses the challenge of evaluating abstract and counterfactual reasoning in language models under limited data and distribution shift by sampling ARC-like tasks from fully specified structural causal models. It defines a causal world-model framework where tasks are generated as SCMs, enabling observational, interventional, and counterfactual queries aligned with the Pearl Causal Hierarchy, and provides a static dataset and codebase for task generation. The work demonstrates four LM-evaluation settings—abstract reasoning with test-time training, counterfactual reasoning via in-context learning, program synthesis, and causal discovery with logical reasoning—revealing substantial variability across tasks and models and indicating clear opportunities to improve LM reasoning capabilities. Overall, CausalARC offers a rigorous, causally grounded benchmark to study reasoning under distributional shifts, with implications for robust abstract reasoning, causal inference, and automated program synthesis in AI systems.

Abstract

On-the-fly reasoning often requires adaptation to novel problems under limited data and distribution shift. This work introduces CausalARC: an experimental testbed for AI reasoning in low-data and out-of-distribution regimes, modeled after the Abstraction and Reasoning Corpus (ARC). Each CausalARC reasoning task is sampled from a fully specified causal world model, formally expressed as a structural causal model. Principled data augmentations provide observational, interventional, and counterfactual feedback about the world model in the form of few-shot, in-context learning demonstrations. As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4) causal discovery with logical reasoning. Within- and between-model performance varied heavily across tasks, indicating room for significant improvement in language model reasoning.

CausalARC: Abstract Reasoning with Causal World Models

TL;DR

CausalARC addresses the challenge of evaluating abstract and counterfactual reasoning in language models under limited data and distribution shift by sampling ARC-like tasks from fully specified structural causal models. It defines a causal world-model framework where tasks are generated as SCMs, enabling observational, interventional, and counterfactual queries aligned with the Pearl Causal Hierarchy, and provides a static dataset and codebase for task generation. The work demonstrates four LM-evaluation settings—abstract reasoning with test-time training, counterfactual reasoning via in-context learning, program synthesis, and causal discovery with logical reasoning—revealing substantial variability across tasks and models and indicating clear opportunities to improve LM reasoning capabilities. Overall, CausalARC offers a rigorous, causally grounded benchmark to study reasoning under distributional shifts, with implications for robust abstract reasoning, causal inference, and automated program synthesis in AI systems.

Abstract

On-the-fly reasoning often requires adaptation to novel problems under limited data and distribution shift. This work introduces CausalARC: an experimental testbed for AI reasoning in low-data and out-of-distribution regimes, modeled after the Abstraction and Reasoning Corpus (ARC). Each CausalARC reasoning task is sampled from a fully specified causal world model, formally expressed as a structural causal model. Principled data augmentations provide observational, interventional, and counterfactual feedback about the world model in the form of few-shot, in-context learning demonstrations. As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4) causal discovery with logical reasoning. Within- and between-model performance varied heavily across tasks, indicating room for significant improvement in language model reasoning.

Paper Structure

This paper contains 24 sections, 2 equations, 27 figures, 4 tables.

Figures (27)

  • Figure 1: The PCH: observing factual realities (L1), exerting actions to induce interventional realities (L2), and imagining alternate counterfactual realities (L3) bareinboim2022pch.
  • Figure 2: The CausalARC testbed. (A) First, SCM $\mathcal{M}$ is manually transcribed in Python code. (B) Input-output pairs are randomly sampled, providing observational (L1) learning signals about the world model. (C) Sampling from interventional submodels $\mathcal{M}'$ of $\mathcal{M}$ yields interventional (L2) samples $(\mathbf{x}',\mathbf{y}')$. Given pair $(\mathbf{x},\mathbf{y})$, performing multiple interventions while holding the exogenous context constant yields a set of counterfactual (L3) pairs. (D) Using L1 and L3 pairs as in-context demonstrations, we can automatically generate natural language prompts for diverse reasoning tasks.
  • Figure 3: Example input-output pairs from ARC-AGI-1 and ARC-AGI-2.
  • Figure 4: Input-output arrays for ARC-AGI-1 task 31d5ba1achollet2024arc. (A) A subset of the official demonstration pairs $(\mathbf{x}_{train}, \mathbf{y}_{train})$. (B) A random sample from SCM $\mathcal{M}_{\texttt{31d5ba1a}}$ defined in Example \ref{['example:scm_xor']}. (C) Causal DAG $\mathcal{G}_{\texttt{31d5ba1a}}$ representing $\mathcal{M}_{\texttt{31d5ba1a}}$, where $\mathbf{y}[i,j] = {\color{Magenta}6} \cdot \mathrm{xor} \left( \mathbf{x}[i,j], \; \mathbf{x}[i+3,j] \right)$ for $i \in [0,2], j \in [0,4]$. (D-E) Samples from interventional submodels of $\mathcal{M}_{\texttt{31d5ba1a}}$, where exogenous variables are held constant and the causal effects of interventions propagate to $\mathbf{y}$.
  • Figure 5: Jointly observed counterfactuals in CausalARC. L1, L2, and L3 denote the rungs of the PCH (Figure \ref{['fig:pch']}). (A) The distribution over the exogenous context (i.e., the external state). (B) Transformations applied to the exogenous context (e.g., functions $\mathcal{F}$ in the observational world; updated functions $\mathcal{F}_\alpha$ under intervention $\alpha$). (C) Induced distributions, following from the applied transformation. (D) CausalARC samples from each rung of the PCH. Adapted from bareinboim2022pch (Figure 27.2).
  • ...and 22 more figures

Theorems & Definitions (14)

  • Definition 2.1: Structural causal model (SCM), bareinboim2022pch
  • Definition 2.2: Hard intervention
  • Definition 2.3: Soft intervention
  • Definition 2.4: Counterfactual, pearl2013structural
  • Definition 2.5: Pearl Causal Hierarchy (PCH), bareinboim2022pch
  • Example 3.1: A fully recovered SCM
  • Definition A.1: Generalization, chollet2019measure
  • Definition A.2: Robustness, chollet2019measure
  • Definition A.3: Flexibility, chollet2019measure
  • Definition A.4: Intelligence, chollet2019measure
  • ...and 4 more