Table of Contents
Fetching ...

Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Miguel Suau, Matthijs T. J. Spaan, Frans A. Oliehoek

TL;DR

Policy confounding describes how RL policies can induce spurious correlations by shaping observed variables, producing habits that fail when trajectories deviate. The authors develop a formal state-representation framework with Markov, minimal, and pi-Markov notions and illustrate the phenomenon with the Frozen T-Maze and other grid-worlds. They show that function approximation and narrow trajectory distributions exacerbate out-of-trajectory generalization failures, especially for on-policy methods, and propose practical mitigations based on off-policy data, exploration, and domain randomization. The work highlights the need for causal-aware representations and provides a foundation for more robust generalization in reinforcement learning.

Abstract

Reinforcement learning agents tend to develop habits that are effective only under specific policies. Following an initial exploration phase where agents try out different actions, they eventually converge onto a particular policy. As this occurs, the distribution over state-action trajectories becomes narrower, leading agents to repeatedly experience the same transitions. This repetitive exposure fosters spurious correlations between certain observations and rewards. Agents may then pick up on these correlations and develop simplistic habits tailored to the specific set of trajectories dictated by their policy. The problem is that these habits may yield incorrect outcomes when agents are forced to deviate from their typical trajectories, prompted by changes in the environment. This paper presents a mathematical characterization of this phenomenon, termed policy confounding, and illustrates, through a series of examples, the circumstances under which it occurs.

Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

TL;DR

Policy confounding describes how RL policies can induce spurious correlations by shaping observed variables, producing habits that fail when trajectories deviate. The authors develop a formal state-representation framework with Markov, minimal, and pi-Markov notions and illustrate the phenomenon with the Frozen T-Maze and other grid-worlds. They show that function approximation and narrow trajectory distributions exacerbate out-of-trajectory generalization failures, especially for on-policy methods, and propose practical mitigations based on off-policy data, exploration, and domain randomization. The work highlights the need for causal-aware representations and provides a foundation for more robust generalization in reinforcement learning.

Abstract

Reinforcement learning agents tend to develop habits that are effective only under specific policies. Following an initial exploration phase where agents try out different actions, they eventually converge onto a particular policy. As this occurs, the distribution over state-action trajectories becomes narrower, leading agents to repeatedly experience the same transitions. This repetitive exposure fosters spurious correlations between certain observations and rewards. Agents may then pick up on these correlations and develop simplistic habits tailored to the specific set of trajectories dictated by their policy. The problem is that these habits may yield incorrect outcomes when agents are forced to deviate from their typical trajectories, prompted by changes in the environment. This paper presents a mathematical characterization of this phenomenon, termed policy confounding, and illustrates, through a series of examples, the circumstances under which it occurs.
Paper Structure (37 sections, 7 theorems, 15 equations, 15 figures, 2 tables)

This paper contains 37 sections, 7 theorems, 15 equations, 15 figures, 2 tables.

Key Result

Proposition 1

Let $\mathbf{\Phi^*}$ be the set of all possible minimal state representations, where every $\Phi^* \in \mathbf{\Phi^*}$ is defined as $\Phi^*: \mathcal{S} \to \bar{\mathcal{S}}^*$ with $\bar{\mathcal{S}}^* = \times \bar{\mathcal{F}}^*$. For all $\pi$ and all $\Phi^* \in \mathbf{\Phi^*}$, there exis

Figures (15)

  • Figure 1: Left: An illustration of the Frozen T-Maze environment. Right: Learning curves when evaluated in the Frozen T-Maze environment with (blue curve) and without (red curve) ice.
  • Figure 2: Two DBNs representing the dynamics of the Frozen T-Maze environment, when actions are sampled at random (left), and when they are determined by the optimal policy (right). The green circles highlight the $\pi$-mininal state representation in each of the two cases.
  • Figure 3: A DBN illustrating the phenomenon of policy confounding. The policy opens a backdoor path that can affect conditional relations between the variables in $F_t$ and $F_{t+1}$.
  • Figure 4: Illustrations of the Key2Door (left) and Diversion (right) environments.
  • Figure 5: DQN vs. PPO in the train and evaluation variants of Frozen T-Maze (left), Key2Door (middle), and Diversion (right).
  • ...and 10 more figures

Theorems & Definitions (24)

  • Definition 1: MDP
  • Definition 2: FMDP
  • Example 1
  • Definition 3: State representation
  • Definition 4: Markov state representation
  • Definition 5: Minimal state representation
  • Definition 6: Superfluous variable
  • Definition 7: $\pi$-Markov state representation
  • Definition 8: $\pi$-minimal state representation
  • Proposition 1
  • ...and 14 more