Table of Contents
Fetching ...

Causal Reasoning from Meta-reinforcement Learning

Ishita Dasgupta, Jane Wang, Silvia Chiappa, Jovana Mitrovic, Pedro Ortega, David Raposo, Edward Hughes, Peter Battaglia, Matthew Botvinick, Zeb Kurth-Nelson

TL;DR

This paper investigates whether causal reasoning can emerge from meta-reinforcement learning by training an RNN-based agent with model-free RL on tasks implemented as causal Bayesian networks. The approach enables an inner learning procedure that infers causal structure from data and performs interventions, observations, and counterfactuals in three data settings (observational, interventional, counterfactual). Across these settings, the agents demonstrate do-calculus-like reasoning, resolve unobserved confounders via interventions, and perform counterfactual predictions, all while learning to actively design informative experiments. The results show that end-to-end meta-learning can yield robust causal reasoning in complex, unseen causal graphs, with potential benefits for structured exploration and scalable, real-time reasoning in reinforcement learning contexts.

Abstract

Discovering and exploiting the causal structure in the environment is a crucial challenge for intelligent agents. Here we explore whether causal reasoning can emerge via meta-reinforcement learning. We train a recurrent network with model-free reinforcement learning to solve a range of problems that each contain causal structure. We find that the trained agent can perform causal reasoning in novel situations in order to obtain rewards. The agent can select informative interventions, draw causal inferences from observational data, and make counterfactual predictions. Although established formal causal reasoning algorithms also exist, in this paper we show that such reasoning can arise from model-free reinforcement learning, and suggest that causal reasoning in complex settings may benefit from the more end-to-end learning-based approaches presented here. This work also offers new strategies for structured exploration in reinforcement learning, by providing agents with the ability to perform -- and interpret -- experiments.

Causal Reasoning from Meta-reinforcement Learning

TL;DR

This paper investigates whether causal reasoning can emerge from meta-reinforcement learning by training an RNN-based agent with model-free RL on tasks implemented as causal Bayesian networks. The approach enables an inner learning procedure that infers causal structure from data and performs interventions, observations, and counterfactuals in three data settings (observational, interventional, counterfactual). Across these settings, the agents demonstrate do-calculus-like reasoning, resolve unobserved confounders via interventions, and perform counterfactual predictions, all while learning to actively design informative experiments. The results show that end-to-end meta-learning can yield robust causal reasoning in complex, unseen causal graphs, with potential benefits for structured exploration and scalable, real-time reasoning in reinforcement learning contexts.

Abstract

Discovering and exploiting the causal structure in the environment is a crucial challenge for intelligent agents. Here we explore whether causal reasoning can emerge via meta-reinforcement learning. We train a recurrent network with model-free reinforcement learning to solve a range of problems that each contain causal structure. We find that the trained agent can perform causal reasoning in novel situations in order to obtain rewards. The agent can select informative interventions, draw causal inferences from observational data, and make counterfactual predictions. Although established formal causal reasoning algorithms also exist, in this paper we show that such reasoning can arise from model-free reinforcement learning, and suggest that causal reasoning in complex settings may benefit from the more end-to-end learning-based approaches presented here. This work also offers new strategies for structured exploration in reinforcement learning, by providing agents with the ability to perform -- and interpret -- experiments.

Paper Structure

This paper contains 12 sections, 7 figures.

Figures (7)

  • Figure 1: (a): A CBN ${\cal G}$ with a confounder for the effect of exercise ($E$) on heath ($H$) given by age ($A$). (b): Intervened CBN ${\cal G}_{\rightarrow E=e}$ resulting from modifying ${\cal G}$ by replacing $p(E|A)$ with a delta distribution $\delta(E-e)$ and leaving the remaining conditional distributions $p(H|E,A)$ and $p(A)$ unaltered.
  • Figure 2: Experiment 1. Agents do cause-effect reasoning from observational data. a) Average reward earned by the agents tested in this experiment. See main text for details. b) Performance split by the presence or absence of at least one parent (Parent and Orphan respectively) on the externally intervened node. c) Quiz phase for a test CBN. Green (red) edges indicate a weight of $+1$ ($-1$). Black represents the intervened node, green (red) nodes indicate a positive (negative) value at that node, white indicates a zero value. The blue circles indicate the agent's choice. Left panel: The undirected version of ${\cal G}$ and the nodes taking the mean values prescribed by $p(X_{1:N\setminus j }|X_j=-5)$, including backward inference to the intervened node's parent. We see that the Optimal Associative Baseline's choice is consistent with maximizing these (incorrect) node values. Right panel: ${\cal G}_{\rightarrow X_j=-5}$ and the nodes taking the mean values prescribed by $p_{\rightarrow X_j=-5}(X_{1:N\setminus j }|X_j=-5)$. We see that the Active-Conditional Agent's choice is consistent with maximizing these (correct) node values.
  • Figure 3: Experiment 2. Agents do cause-effect reasoning from interventional data. a) Average reward earned by the agents tested in this experiment. See main text for details. b) Performance split by the presence or absence of unobserved confounders (abbreviated as Conf. and Unconf. respectively) on the externally intervened node. c) Quiz phase for a test CBN. See Fig. \ref{['fig:expt1']} for a legend. Here, the left panel shows the full $\cal G$ and the nodes taking the mean values prescribed by $p(X_{1:N\setminus j }|X_j=-5)$. We see that the Active-Cond Agent's choice is consistent with choosing based on these (incorrect) node values. The right panel shows ${\cal G}_{\rightarrow X_j=-5}$ and the nodes taking the mean values prescribed by $p_{\rightarrow X_j=-5}(X_{1:N\setminus j }|X_j=-5)$. We see that the Active-Int. Agent's choice is consistent with maximizing on these (correct) node value.
  • Figure 4: Active and Random Conditional Agents
  • Figure 5: Active and Random Interventional Agents
  • ...and 2 more figures