Confounding Robust Continuous Control via Automatic Reward Shaping

Mateo Juliani; Mingxuan Li; Elias Bareinboim

Confounding Robust Continuous Control via Automatic Reward Shaping

Mateo Juliani, Mingxuan Li, Elias Bareinboim

TL;DR

The paper addresses the problem of learning effective reward shaping for continuous control when offline data is confounded by unobserved variables. It introduces a causal upper-bounded state value function learned from confounded offline data via a Causal Bellman Equation and uses this as a potential in Potential-Based Reward Shaping to guide online SAC training. The method is validated across six confounded continuous-control tasks (MuJoCo and Adroit), showing robust improvements over unshaped SAC and causally unaware shaping baselines, with analyses on data quality and confounding strength. The work provides theoretical guarantees (convergence and policy-invariance of PBRS in CMDPs) and demonstrates a practical, scalable approach to confounding-robust reinforcement learning with causal reward design. This represents a meaningful step toward robust continuous control under real-world confounding conditions and offers a foundation for further extensions to broader confounding settings and higher-dimensional observations.

Abstract

Reward shaping has been applied widely to accelerate Reinforcement Learning (RL) agents' training. However, a principled way of designing effective reward shaping functions, especially for complex continuous control problems, remains largely under-explained. In this work, we propose to automatically learn a reward shaping function for continuous control problems from offline datasets, potentially contaminated by unobserved confounding variables. Specifically, our method builds upon the recently proposed causal Bellman equation to learn a tight upper bound on the optimal state values, which is then used as the potentials in the Potential-Based Reward Shaping (PBRS) framework. Our proposed reward shaping algorithm is tested with Soft-Actor-Critic (SAC) on multiple commonly used continuous control benchmarks and exhibits strong performance guarantees under unobserved confounders. More broadly, our work marks a solid first step towards confounding robust continuous control from a causal perspective. Code for training our reward shaping functions can be found at https://github.com/mateojuliani/confounding_robust_cont_control.

Confounding Robust Continuous Control via Automatic Reward Shaping

TL;DR

Abstract

Paper Structure (41 sections, 3 theorems, 15 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 41 sections, 3 theorems, 15 equations, 11 figures, 7 tables, 1 algorithm.

Introduction
Background
Confounding Robust Decision-making
Continuous Control with Deep Reinforcement Learning
Potential-based reward shaping (PBRS)
Notations
The Challenge of Confounded Continuous Control
Confounded Continuous Control with Shaped Rewards
Learn Optimistic State Potentials via Confounding Robust Offline Pretraining
Online Fine-tuning with Reward Shaping
Experiments
Experiment Design
Causal Reward Shaping Performance
Hopper
HalfCheetah
...and 26 more sections

Key Result

Theorem 4.1

For a CMDP environment $\mathcal{M}$ with reward $Y_h \leq b, b\in \mathbb{R}$, the optimal value of interventional policies, $V^*(\boldsymbol{s}), \forall \boldsymbol{s}\in \mathcal{S}$, is upper bounded by $V^*(s) \leq \overline{V}(s)$ satisfying the Causal Bellman Optimality Equation, where $\widetilde{\mathcal{R}}$ is offline estimated reward distribution and $\widetilde{\mathcal{T}}$ is the

Figures (11)

Figure 1: (a) CMDP causal diagram of the offline data generating process; (b) CMDP causal diagram under policy $\operatorname{do}({\pi})$ during the online learning process. Compared with the standard MDP, the highlighted bi-directed dashed arrows represent confounders affecting both behavioral policy, state transitions and rewards while being unobservable to the online agents.
Figure 2: Performance of Hopper SAC Agent with full capacity and SAC agent unable to observe state 2.
Figure 3: Normalized IQM returns w.r.t State Removed Baseline SAC agent in confounded continuous control benchmarks.
Figure 4: RCIT test statistic v.s. Causal PBRS improvements.
Figure 5: Causal PBRS performance by offline data quality.
...and 6 more figures

Theorems & Definitions (4)

Definition 3.1
Theorem 4.1: Causal Bellman Optimal Equation for Stationary Infinite-Horizon CMDPs
Theorem 4.2: Convergence of Causal Bellman Optimal Equation
Proposition 4.3

Confounding Robust Continuous Control via Automatic Reward Shaping

TL;DR

Abstract

Confounding Robust Continuous Control via Automatic Reward Shaping

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (4)