Table of Contents
Fetching ...

Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models

Farzane Aminmansour, Taher Jafferjee, Ehsan Imani, Erin Talvitie, Micheal Bowling, Martha White

Abstract

Dyna-style reinforcement learning (RL) agents improve sample efficiency over model-free RL agents by updating the value function with simulated experience generated by an environment model. However, it is often difficult to learn accurate models of environment dynamics, and even small errors may result in failure of Dyna agents. In this paper, we highlight that one potential cause of that failure is bootstrapping off of the values of simulated states, and introduce a new Dyna algorithm to avoid this failure. We discuss a design space of Dyna algorithms, based on using successor or predecessor models -- simulating forwards or backwards -- and using one-step or multi-step updates. Three of the variants have been explored, but surprisingly the fourth variant has not: using predecessor models with multi-step updates. We present the \emph{Hallucinated Value Hypothesis} (HVH): updating the values of real states towards values of simulated states can result in misleading action values which adversely affect the control policy. We discuss and evaluate all four variants of Dyna amongst which three update real states toward simulated states -- so potentially toward hallucinated values -- and our proposed approach, which does not. The experimental results provide evidence for the HVH, and suggest that using predecessor models with multi-step updates is a promising direction toward developing Dyna algorithms that are more robust to model error.

Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models

Abstract

Dyna-style reinforcement learning (RL) agents improve sample efficiency over model-free RL agents by updating the value function with simulated experience generated by an environment model. However, it is often difficult to learn accurate models of environment dynamics, and even small errors may result in failure of Dyna agents. In this paper, we highlight that one potential cause of that failure is bootstrapping off of the values of simulated states, and introduce a new Dyna algorithm to avoid this failure. We discuss a design space of Dyna algorithms, based on using successor or predecessor models -- simulating forwards or backwards -- and using one-step or multi-step updates. Three of the variants have been explored, but surprisingly the fourth variant has not: using predecessor models with multi-step updates. We present the \emph{Hallucinated Value Hypothesis} (HVH): updating the values of real states towards values of simulated states can result in misleading action values which adversely affect the control policy. We discuss and evaluate all four variants of Dyna amongst which three update real states toward simulated states -- so potentially toward hallucinated values -- and our proposed approach, which does not. The experimental results provide evidence for the HVH, and suggest that using predecessor models with multi-step updates is a promising direction toward developing Dyna algorithms that are more robust to model error.

Paper Structure

This paper contains 16 sections, 8 equations, 9 figures, 1 table, 3 algorithms.

Figures (9)

  • Figure 1: A visual comparison of the planning updates in four Dyna algorithms. Circles and black arrows show a trajectory; solid circles are real states and dashed circles are simulated states. A red arrow means that the value of the originating state is updated towards the destination state. All algorithms except Multi-Step Predecessor Dyna allow updates towards simulated states, which we show in Section \ref{['Section:Hypothesis']} that these approaches can suffer from updating towards hallucinated values.
  • Figure 2: (a) The Borderworld environment. (b) An example of an erroneous simulated transition in BorderWorld.
  • Figure 3: Situation 1 corresponds to updating on a simulated state that is far from the experienced states in the environment. In this situation, a simulated state $\hat{s}$ is less likely to skew $f(s)$ when it is updated with ($\hat{s}$, $\hat{y}$). Situation 2 corresponds to updating a (simulated) state within the experienced states $\mathcal{S}$. In this situation, the given target might skew $f(s)$ (Case 2) or not (Case 1).
  • Figure 4: Learning curves on Borderworld over all of the algorithms with the same computational complexity and approximately the same number of updates. Multi-step Dyna variants allow biased one-step updates similar to One-step variants, but would not put that unbiased transition back into the queue for further rollouts. The screening approach in Multi-step variants is also only possible on the on-policy sub-chunks of a trajectory to avoid biased TD errors as much as possible. Error bars are not visible as they are smaller than line thicknesses.
  • Figure 5: Plot of $\max_a Q(s, a) \:\forall s \in \mathcal{S}$ after $100,000$ steps. The red rectangles show where values of real states have been contaminated by values of simulated states.
  • ...and 4 more figures