Reinforcement Learning with Exogenous States and Rewards
George Trimponias, Thomas G. Dietterich
TL;DR
This work addresses the slow learning caused by exogenous variability in reinforcement learning by introducing a formal exogenous–endogenous decomposition of MDPs under additive reward structure. It shows that the endogenous MDP, coupled with the exogenous reward regression, yields a decomposed Bellman structure where optimizing the endogenous MDP suffices for optimal policy in the original MDP. The authors formalize exogeno-decomposition, prove the maximal exogenous subspace exists and is unique, and develop two practical algorithms, GRDS and SRAS, to discover the exogenous subspace when mixing is linear; they also propose the CCC surrogate for conditional mutual information. Empirical results across high-dimensional linear, nonlinear, discrete, and combinatorial-action MDPs demonstrate substantial speedups and, in many cases, end-to-end performance matching or exceeding an endogenous-reward oracle, with Simplified-GRDS often the most robust option. The work advances RL in environments with uncontrollable exogenous factors by providing a principled, online method to isolate and remove exogenous reward noise, accelerating learning in complex control tasks.
Abstract
Exogenous state variables and rewards can slow reinforcement learning by injecting uncontrolled variation into the reward signal. This paper formalizes exogenous state variables and rewards and shows that if the reward function decomposes additively into endogenous and exogenous components, the MDP can be decomposed into an exogenous Markov Reward Process (based on the exogenous reward) and an endogenous Markov Decision Process (optimizing the endogenous reward). Any optimal policy for the endogenous MDP is also an optimal policy for the original MDP, but because the endogenous reward typically has reduced variance, the endogenous MDP is easier to solve. We study settings where the decomposition of the state space into exogenous and endogenous state spaces is not given but must be discovered. The paper introduces and proves correctness of algorithms for discovering the exogenous and endogenous subspaces of the state space when they are mixed through linear combination. These algorithms can be applied during reinforcement learning to discover the exogenous subspace, remove the exogenous reward, and focus reinforcement learning on the endogenous MDP. Experiments on a variety of challenging synthetic MDPs show that these methods, applied online, discover large exogenous state spaces and produce substantial speedups in reinforcement learning.
