On shallow planning under partial observability

Randy Lefebvre; Audrey Durand

On shallow planning under partial observability

Randy Lefebvre, Audrey Durand

TL;DR

This work analyzes how the discount factor, which defines planning horizon, affects the bias-variance trade-off in reinforcement learning under partial observability. It extends structural MDP parameters to horizon-sensitive and model-approximation contexts, and further develops a bound on planning loss that combines bias and variance components, providing tighter insights than prior work in certain regimes. A key contribution is extending these ideas to POMDPs via history compression and a corresponding set of horizon-sensitive parameters, with theoretical results complemented by numerical experiments on Random MDPs, POMDP abstractions, and a Cartpole deep RL study. The findings suggest that shallow planning can be advantageous in partially observable regimes, offering practical guidance for choosing discount factors in real-world RL, and the work provides open-source code to facilitate further exploration. Overall, the paper links planning horizon choices to structural properties of the environment, offering a principled approach to discount-factor selection in settings with limited data and partial observability.

Abstract

Formulating a real-world problem under the Reinforcement Learning framework involves non-trivial design choices, such as selecting a discount factor for the learning objective (discounted cumulative rewards), which articulates the planning horizon of the agent. This work investigates the impact of the discount factor on the bias-variance trade-off given structural parameters of the underlying Markov Decision Process. Our results support the idea that a shorter planning horizon might be beneficial, especially under partial observability.

On shallow planning under partial observability

TL;DR

Abstract

Paper Structure (26 sections, 6 theorems, 39 equations, 4 figures, 1 table)

This paper contains 26 sections, 6 theorems, 39 equations, 4 figures, 1 table.

Introduction
Contributions
Fully Observable Setting
Blackwell Discount Factor
Planning Loss
Improving the Bias Bound
Controlling the Variance
A New Bound on the Planning Loss
Bias Under Partial Observability
Extending Structural Parameters
Numerical Experiments
Random MDPs
Extension to Partial Observability
Impact of Partial Observability on Deep RL
Related Work
...and 11 more sections

Key Result

Proposition 1

Given an MDP $M$, let $P_{s, k}^\pi$ denote the vector of the transition probabilities from state $s\in\mathcal{S}$ to every possible states when following policy $\pi$ for $k\geq 1$ time steps. The transition probabilities when following the policy that is optimal for a shallow planning horizon ($\

Figures (4)

Figure 1: Proportion of randomly sampled MDPs where Eq. \ref{['eq:condition']} is true given a discount factor $\gamma$.
Figure 2: Left: Distribution of Blackwell discount factors over $10^4$ POMDPs given the number of observations. Right: Average normalized bias given the discount factor and number of observations.
Figure 3: Left: Distribution of normalized $\kappa_{M,\gamma}^\phi$ over $10^4$ POMDPs given the number of observations and the discount factor used. Right: Distribution of normalized $\delta_M^\phi$ given the number of observations.
Figure 4: Average reward and standard deviation obtained by running 10 models on 100 environment seeds given the noise level and discount factor.

Theorems & Definitions (12)

Definition 1: Value-function variation jiang2016structural
Definition 2: Action variation jiang2016structural
Definition 3: Discordant state-action pairs
Definition 4: Horizon-sensitive action variation
Proposition 1: Horizon-sensitive transition probabilities distance
Definition 5: Variance due to model approximation
Definition 6: Empirical action variation
Proposition 2: Empirical transition probabilities distance
Lemma 1: Variance
Theorem 1: Planning loss
...and 2 more

On shallow planning under partial observability

TL;DR

Abstract

On shallow planning under partial observability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (12)