Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Miguel Suau; Matthijs T. J. Spaan; Frans A. Oliehoek

Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Miguel Suau, Matthijs T. J. Spaan, Frans A. Oliehoek

TL;DR

Policy confounding describes how RL policies can induce spurious correlations by shaping observed variables, producing habits that fail when trajectories deviate. The authors develop a formal state-representation framework with Markov, minimal, and pi-Markov notions and illustrate the phenomenon with the Frozen T-Maze and other grid-worlds. They show that function approximation and narrow trajectory distributions exacerbate out-of-trajectory generalization failures, especially for on-policy methods, and propose practical mitigations based on off-policy data, exploration, and domain randomization. The work highlights the need for causal-aware representations and provides a foundation for more robust generalization in reinforcement learning.

Abstract

Reinforcement learning agents tend to develop habits that are effective only under specific policies. Following an initial exploration phase where agents try out different actions, they eventually converge onto a particular policy. As this occurs, the distribution over state-action trajectories becomes narrower, leading agents to repeatedly experience the same transitions. This repetitive exposure fosters spurious correlations between certain observations and rewards. Agents may then pick up on these correlations and develop simplistic habits tailored to the specific set of trajectories dictated by their policy. The problem is that these habits may yield incorrect outcomes when agents are forced to deviate from their typical trajectories, prompted by changes in the environment. This paper presents a mathematical characterization of this phenomenon, termed policy confounding, and illustrates, through a series of examples, the circumstances under which it occurs.

Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

TL;DR

Abstract

Paper Structure (37 sections, 7 theorems, 15 equations, 15 figures, 2 tables)

This paper contains 37 sections, 7 theorems, 15 equations, 15 figures, 2 tables.

Introduction
Contributions
Example: Frozen T-Maze
Related Work
Preliminaries
Notation
Problem formulation
State representations
Markov state representations
$\pi$-Markov state representations
Policy Confounding
Why should we care about policy confounding?
When should we worry about OOT generalization in practice?
Function approximation
Narrow trajectory distributions
...and 22 more sections

Key Result

Proposition 1

Let $\mathbf{\Phi^*}$ be the set of all possible minimal state representations, where every $\Phi^* \in \mathbf{\Phi^*}$ is defined as $\Phi^*: \mathcal{S} \to \bar{\mathcal{S}}^*$ with $\bar{\mathcal{S}}^* = \times \bar{\mathcal{F}}^*$. For all $\pi$ and all $\Phi^* \in \mathbf{\Phi^*}$, there exis

Figures (15)

Figure 1: Left: An illustration of the Frozen T-Maze environment. Right: Learning curves when evaluated in the Frozen T-Maze environment with (blue curve) and without (red curve) ice.
Figure 2: Two DBNs representing the dynamics of the Frozen T-Maze environment, when actions are sampled at random (left), and when they are determined by the optimal policy (right). The green circles highlight the $\pi$-mininal state representation in each of the two cases.
Figure 3: A DBN illustrating the phenomenon of policy confounding. The policy opens a backdoor path that can affect conditional relations between the variables in $F_t$ and $F_{t+1}$.
Figure 4: Illustrations of the Key2Door (left) and Diversion (right) environments.
Figure 5: DQN vs. PPO in the train and evaluation variants of Frozen T-Maze (left), Key2Door (middle), and Diversion (right).
...and 10 more figures

Theorems & Definitions (24)

Definition 1: MDP
Definition 2: FMDP
Example 1
Definition 3: State representation
Definition 4: Markov state representation
Definition 5: Minimal state representation
Definition 6: Superfluous variable
Definition 7: $\pi$-Markov state representation
Definition 8: $\pi$-minimal state representation
Proposition 1
...and 14 more

Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

TL;DR

Abstract

Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (24)