Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL
Miguel Suau, Matthijs T. J. Spaan, Frans A. Oliehoek
TL;DR
Policy confounding describes how RL policies can induce spurious correlations by shaping observed variables, producing habits that fail when trajectories deviate. The authors develop a formal state-representation framework with Markov, minimal, and pi-Markov notions and illustrate the phenomenon with the Frozen T-Maze and other grid-worlds. They show that function approximation and narrow trajectory distributions exacerbate out-of-trajectory generalization failures, especially for on-policy methods, and propose practical mitigations based on off-policy data, exploration, and domain randomization. The work highlights the need for causal-aware representations and provides a foundation for more robust generalization in reinforcement learning.
Abstract
Reinforcement learning agents tend to develop habits that are effective only under specific policies. Following an initial exploration phase where agents try out different actions, they eventually converge onto a particular policy. As this occurs, the distribution over state-action trajectories becomes narrower, leading agents to repeatedly experience the same transitions. This repetitive exposure fosters spurious correlations between certain observations and rewards. Agents may then pick up on these correlations and develop simplistic habits tailored to the specific set of trajectories dictated by their policy. The problem is that these habits may yield incorrect outcomes when agents are forced to deviate from their typical trajectories, prompted by changes in the environment. This paper presents a mathematical characterization of this phenomenon, termed policy confounding, and illustrates, through a series of examples, the circumstances under which it occurs.
