On shallow planning under partial observability
Randy Lefebvre, Audrey Durand
TL;DR
This work analyzes how the discount factor, which defines planning horizon, affects the bias-variance trade-off in reinforcement learning under partial observability. It extends structural MDP parameters to horizon-sensitive and model-approximation contexts, and further develops a bound on planning loss that combines bias and variance components, providing tighter insights than prior work in certain regimes. A key contribution is extending these ideas to POMDPs via history compression and a corresponding set of horizon-sensitive parameters, with theoretical results complemented by numerical experiments on Random MDPs, POMDP abstractions, and a Cartpole deep RL study. The findings suggest that shallow planning can be advantageous in partially observable regimes, offering practical guidance for choosing discount factors in real-world RL, and the work provides open-source code to facilitate further exploration. Overall, the paper links planning horizon choices to structural properties of the environment, offering a principled approach to discount-factor selection in settings with limited data and partial observability.
Abstract
Formulating a real-world problem under the Reinforcement Learning framework involves non-trivial design choices, such as selecting a discount factor for the learning objective (discounted cumulative rewards), which articulates the planning horizon of the agent. This work investigates the impact of the discount factor on the bias-variance trade-off given structural parameters of the underlying Markov Decision Process. Our results support the idea that a shorter planning horizon might be beneficial, especially under partial observability.
