Confounding Robust Continuous Control via Automatic Reward Shaping
Mateo Juliani, Mingxuan Li, Elias Bareinboim
TL;DR
The paper addresses the problem of learning effective reward shaping for continuous control when offline data is confounded by unobserved variables. It introduces a causal upper-bounded state value function learned from confounded offline data via a Causal Bellman Equation and uses this as a potential in Potential-Based Reward Shaping to guide online SAC training. The method is validated across six confounded continuous-control tasks (MuJoCo and Adroit), showing robust improvements over unshaped SAC and causally unaware shaping baselines, with analyses on data quality and confounding strength. The work provides theoretical guarantees (convergence and policy-invariance of PBRS in CMDPs) and demonstrates a practical, scalable approach to confounding-robust reinforcement learning with causal reward design. This represents a meaningful step toward robust continuous control under real-world confounding conditions and offers a foundation for further extensions to broader confounding settings and higher-dimensional observations.
Abstract
Reward shaping has been applied widely to accelerate Reinforcement Learning (RL) agents' training. However, a principled way of designing effective reward shaping functions, especially for complex continuous control problems, remains largely under-explained. In this work, we propose to automatically learn a reward shaping function for continuous control problems from offline datasets, potentially contaminated by unobserved confounding variables. Specifically, our method builds upon the recently proposed causal Bellman equation to learn a tight upper bound on the optimal state values, which is then used as the potentials in the Potential-Based Reward Shaping (PBRS) framework. Our proposed reward shaping algorithm is tested with Soft-Actor-Critic (SAC) on multiple commonly used continuous control benchmarks and exhibits strong performance guarantees under unobserved confounders. More broadly, our work marks a solid first step towards confounding robust continuous control from a causal perspective. Code for training our reward shaping functions can be found at https://github.com/mateojuliani/confounding_robust_cont_control.
