Table of Contents
Fetching ...

Causally Correct Partial Models for Reinforcement Learning

Danilo J. Rezende, Ivo Danihelka, George Papamakarios, Nan Rosemary Ke, Ray Jiang, Theophane Weber, Karol Gregor, Hamza Merzic, Fabio Viola, Jane Wang, Jovana Mitrovic, Frederic Besse, Ioannis Antonoglou, Lars Buesing

TL;DR

This paper shows that partial models can be causally incorrect: they are confounded by the observations they don't model, and can therefore lead to incorrect planning, and introduces a general family of partial models that are provably causally correct, yet remain fast because they do not need to fully model future observations.

Abstract

In reinforcement learning, we can learn a model of future observations and rewards, and use it to plan the agent's next actions. However, jointly modeling future observations can be computationally expensive or even intractable if the observations are high-dimensional (e.g. images). For this reason, previous works have considered partial models, which model only part of the observation. In this paper, we show that partial models can be causally incorrect: they are confounded by the observations they don't model, and can therefore lead to incorrect planning. To address this, we introduce a general family of partial models that are provably causally correct, yet remain fast because they do not need to fully model future observations.

Causally Correct Partial Models for Reinforcement Learning

TL;DR

This paper shows that partial models can be causally incorrect: they are confounded by the observations they don't model, and can therefore lead to incorrect planning, and introduces a general family of partial models that are provably causally correct, yet remain fast because they do not need to fully model future observations.

Abstract

In reinforcement learning, we can learn a model of future observations and rewards, and use it to plan the agent's next actions. However, jointly modeling future observations can be computationally expensive or even intractable if the observations are high-dimensional (e.g. images). For this reason, previous works have considered partial models, which model only part of the observation. In this paper, we show that partial models can be causally incorrect: they are confounded by the observations they don't model, and can therefore lead to incorrect planning. To address this, we introduce a general family of partial models that are provably causally correct, yet remain fast because they do not need to fully model future observations.

Paper Structure

This paper contains 37 sections, 28 equations, 14 figures, 6 tables, 2 algorithms.

Figures (14)

  • Figure 1: Examples of stochastic MDPs. (a) FuzzyBear: after visiting a forest, the agent meets either a teddy bear or a grizzly bear with $50\%$ chance and can either hug the bear or run away. (b) AvoidFuzzyBear: here, the agent has the extra option to stay home.
  • Figure 2: Illustration of various causal graphs. (a) Simple dependence without confounding. This is the prevailing assumption in many machine-learning applications. (b) Graph with confounding. (c) Intervention on graph (b) equivalent to setting the value of $x$ and observing $y$. (d) Graph with a backdoor $z$ blocking all paths from $u$ to $x$. (e) Graph with a frontdoor $z$ blocking all paths from $x$ to $y$. (f) Graph with a variable $z$ blocking the direct path from $u$ to $y$.
  • Figure 3: Graphical representations of the environment, the agent, and the various models. Circles are stochastic nodes, rectangles are deterministic nodes. (a) Agent interacting with the environment, generating a trajectory $\{y_t, a_t\}_{t=0}^T$. These trajectories are the training data for the models. (b) Same as (a) but also including the backdoor $z_t$ in the generated trajectory. The red arrows indicate the locations of the interventions. (c) Standard autoregressive generative model of observations. The model predicts the observation $y_t$ which it then feeds into $h_{t+1}$. (d) Example of a Non-Causal Partial Model (NCPM) that predicts the observation $y_t$ without feeding it into $h_{t+1}$. (e) Proposed Causal Partial Model (CPM), with a backdoor $z_t$ for the actions.
  • Figure 4: MDP Analysis: In the FuzzyBear environment, we randomly generate 500 policies and scatter-plot them with x-axis showing the quality of the behavior policy $V^\pi_{env}$ and y-axis showing corresponding model optimal evaluations $V^*_{M(\pi)}$. For each policy, we derive the corresponding converged model $M(\pi)$ equivalent to training on data generated by the policy. We then compute the optimal evaluation $V^*_{M(\pi)}$ using this model. We contrast the unrealistic optimism of the non-causal model evaluations $V^*_{\text{NCPM}(\pi)}$ with the more realistic causal model evaluations $V^*_{\text{CPM}(\pi)}$ for good policies $\pi$, as well as the over-pessimism of the non-causal model compared to the causal model for bad policies.
  • Figure 5: \ref{['fig:mdp_mcts']} MCTS on AvoidFuzzyBear with $p(\textit{Teddy bear}) = 0.55$. The optimal policy should achieve reward 0.6. \ref{['fig:minipacman_expectimax']} The non-causal partial model (NCPM) produced visibly worse expectimax search. \ref{['fig:minipacman_nobranch']} The causal partial model (CPM) was compatible with MuZero-style MCTS. The search was able to find a much better policy than the pretrained behavior policy.
  • ...and 9 more figures