Information-Theoretic State Variable Selection for Reinforcement Learning
Charles Westphal, Stephen Hailes, Mirco Musolesi
TL;DR
This work addresses the challenge of learning compact, informative state representations in reinforcement learning by introducing the Transfer Entropy Redundancy Criterion (TERC), an information-theoretic measure that quantifies how state variables reduce the uncertainty in actions. The core idea is to include only variables that exhibit positive transfer entropy to actions, while rigorously handling perfect conditional redundancy (PCR/PMCR/CPMCR) to avoid discarding informative variables or retaining redundant ones. The authors provide theoretical guarantees and practical algorithms for deriving the minimal informative state subset, with a Naïve method, a CPMCR-aware algorithm, and a simplified variant that scales linearly with the number of variables. Extensive experiments on synthetic data and diverse RL benchmarks (Cart Pole, Lunar Lander, Pendulum, Secret Key Game, Iterated Prisoner’s Dilemma) show that TER C consistently identifies the optimal variable set and accelerates learning compared to UMFI and PI baselines, while also enabling interpretable tracking of information transfer during training.
Abstract
Identifying the most suitable variables to represent the state is a fundamental challenge in Reinforcement Learning (RL). These variables must efficiently capture the information necessary for making optimal decisions. In order to address this problem, in this paper, we introduce the Transfer Entropy Redundancy Criterion (TERC), an information-theoretic criterion, which determines if there is \textit{entropy transferred} from state variables to actions during training. We define an algorithm based on TERC that provably excludes variables from the state that have no effect on the final performance of the agent, resulting in more sample efficient learning. Experimental results show that this speed-up is present across three different algorithm classes (represented by tabular Q-learning, Actor-Critic, and Proximal Policy Optimization (PPO)) in a variety of environments. Furthermore, to highlight the differences between the proposed methodology and the current state-of-the-art feature selection approaches, we present a series of controlled experiments on synthetic data, before generalizing to real-world decision-making tasks. We also introduce a representation of the problem that compactly captures the transfer of information from state variables to actions as Bayesian networks.
