Table of Contents
Fetching ...

On shallow planning under partial observability

Randy Lefebvre, Audrey Durand

TL;DR

This work analyzes how the discount factor, which defines planning horizon, affects the bias-variance trade-off in reinforcement learning under partial observability. It extends structural MDP parameters to horizon-sensitive and model-approximation contexts, and further develops a bound on planning loss that combines bias and variance components, providing tighter insights than prior work in certain regimes. A key contribution is extending these ideas to POMDPs via history compression and a corresponding set of horizon-sensitive parameters, with theoretical results complemented by numerical experiments on Random MDPs, POMDP abstractions, and a Cartpole deep RL study. The findings suggest that shallow planning can be advantageous in partially observable regimes, offering practical guidance for choosing discount factors in real-world RL, and the work provides open-source code to facilitate further exploration. Overall, the paper links planning horizon choices to structural properties of the environment, offering a principled approach to discount-factor selection in settings with limited data and partial observability.

Abstract

Formulating a real-world problem under the Reinforcement Learning framework involves non-trivial design choices, such as selecting a discount factor for the learning objective (discounted cumulative rewards), which articulates the planning horizon of the agent. This work investigates the impact of the discount factor on the bias-variance trade-off given structural parameters of the underlying Markov Decision Process. Our results support the idea that a shorter planning horizon might be beneficial, especially under partial observability.

On shallow planning under partial observability

TL;DR

This work analyzes how the discount factor, which defines planning horizon, affects the bias-variance trade-off in reinforcement learning under partial observability. It extends structural MDP parameters to horizon-sensitive and model-approximation contexts, and further develops a bound on planning loss that combines bias and variance components, providing tighter insights than prior work in certain regimes. A key contribution is extending these ideas to POMDPs via history compression and a corresponding set of horizon-sensitive parameters, with theoretical results complemented by numerical experiments on Random MDPs, POMDP abstractions, and a Cartpole deep RL study. The findings suggest that shallow planning can be advantageous in partially observable regimes, offering practical guidance for choosing discount factors in real-world RL, and the work provides open-source code to facilitate further exploration. Overall, the paper links planning horizon choices to structural properties of the environment, offering a principled approach to discount-factor selection in settings with limited data and partial observability.

Abstract

Formulating a real-world problem under the Reinforcement Learning framework involves non-trivial design choices, such as selecting a discount factor for the learning objective (discounted cumulative rewards), which articulates the planning horizon of the agent. This work investigates the impact of the discount factor on the bias-variance trade-off given structural parameters of the underlying Markov Decision Process. Our results support the idea that a shorter planning horizon might be beneficial, especially under partial observability.
Paper Structure (26 sections, 6 theorems, 39 equations, 4 figures, 1 table)

This paper contains 26 sections, 6 theorems, 39 equations, 4 figures, 1 table.

Key Result

Proposition 1

Given an MDP $M$, let $P_{s, k}^\pi$ denote the vector of the transition probabilities from state $s\in\mathcal{S}$ to every possible states when following policy $\pi$ for $k\geq 1$ time steps. The transition probabilities when following the policy that is optimal for a shallow planning horizon ($\

Figures (4)

  • Figure 1: Proportion of randomly sampled MDPs where Eq. \ref{['eq:condition']} is true given a discount factor $\gamma$.
  • Figure 2: Left: Distribution of Blackwell discount factors over $10^4$ POMDPs given the number of observations. Right: Average normalized bias given the discount factor and number of observations.
  • Figure 3: Left: Distribution of normalized $\kappa_{M,\gamma}^\phi$ over $10^4$ POMDPs given the number of observations and the discount factor used. Right: Distribution of normalized $\delta_M^\phi$ given the number of observations.
  • Figure 4: Average reward and standard deviation obtained by running 10 models on 100 environment seeds given the noise level and discount factor.

Theorems & Definitions (12)

  • Definition 1: Value-function variation jiang2016structural
  • Definition 2: Action variation jiang2016structural
  • Definition 3: Discordant state-action pairs
  • Definition 4: Horizon-sensitive action variation
  • Proposition 1: Horizon-sensitive transition probabilities distance
  • Definition 5: Variance due to model approximation
  • Definition 6: Empirical action variation
  • Proposition 2: Empirical transition probabilities distance
  • Lemma 1: Variance
  • Theorem 1: Planning loss
  • ...and 2 more