A New View on Planning in Online Reinforcement Learning

Kevin Roice; Parham Mohammad Panahi; Scott M. Jordan; Adam White; Martha White

A New View on Planning in Online Reinforcement Learning

Kevin Roice, Parham Mohammad Panahi, Scott M. Jordan, Adam White, Martha White

TL;DR

The paper introduces Goal-Space Planning (GSP), a subgoal-focused background planning framework that learns local, subgoal-conditioned models to propagate value in an abstract subgoal space, avoiding full dynamics learning. By planning over subgoals and using the resulting values as potential-based shaping, GSP accelerates learning for base learners such as Sarsa($\lambda$) and DDQN across both small and large state spaces, including deep RL settings. Key contributions include a modular GSP formulation, demonstrations of accelerated value propagation in FourRooms, GridBall, and PinBall, and practical insights into stabilizing shaping terms in neural networks. The work addresses sample efficiency and adaptability in online RL, with open questions on subgoal discovery and stability when scaling to deep, high-dimensional environments.

Abstract

This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can propagate value from an abstract space in a manner that helps a variety of base learners learn significantly faster in different domains.

A New View on Planning in Online Reinforcement Learning

TL;DR

) and DDQN across both small and large state spaces, including deep RL settings. Key contributions include a modular GSP formulation, demonstrations of accelerated value propagation in FourRooms, GridBall, and PinBall, and practical insights into stabilizing shaping terms in neural networks. The work addresses sample efficiency and adaptability in online RL, with open questions on subgoal discovery and stability when scaling to deep, high-dimensional environments.

Abstract

Paper Structure (13 sections, 6 equations, 21 figures, 6 algorithms)

This paper contains 13 sections, 6 equations, 21 figures, 6 algorithms.

Introduction
Problem Formulation
Goal Space Planning
Experiments
GSP on Propagating Value
GSP in Larger State Spaces
GSP with Deep Reinforcement Learning
Related Work
Conclusion
Environments
Learning the Option Policies
Learning the Subgoal Models
Pseudocode

Figures (21)

Figure 1: GSP in the PinBall domain. The agent begins with a set of subgoals (denoted in teal) and learns a set of subgoal-conditioned models. (Abstraction) Using these models, the agent forms an abstract MDP where the states are subgoals with options to reach each subgoal as actions. (Planning) The agent plans in this abstract MDP to quickly learn the values of these subgoals. (Projection) Using learned subgoal values, the agent obtains approximate values of states based on nearby subgoals and their values. These quickly updated approximate values are then used to speed up learning.
Figure 2: Goal-Space Planning.
Figure 3: These four plots show the action values after a single episode of updates for Sarsa with and without GSP and eligibility traces, i.e., $\lambda = 0.9$. Each algorithm's update is simulated from the same data collected from a uniform random policy. Each state (square) is made up of four triangles representing each of the four available actions. White squares represent states not visited in the episode.
Figure 4: This plot shows the average number of steps to goal smoothed over five episodes in the FourRooms domain. Shaded region represents 1 standard error across 100 runs.
Figure 7: Investigating the behavior of GSP in the deep reinforcement learning setting in PinBall. Following the format of Figure \ref{['fig:gridball_pinball_curve']}, we show the 20 episode moving average of steps to the main goal in PinBall.
...and 16 more figures

A New View on Planning in Online Reinforcement Learning

TL;DR

Abstract

A New View on Planning in Online Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (21)