Learning Causally Invariant Reward Functions from Diverse Demonstrations

Ivan Ovinnikov; Eugene Bykovets; Joachim M. Buhmann

Learning Causally Invariant Reward Functions from Diverse Demonstrations

Ivan Ovinnikov, Eugene Bykovets, Joachim M. Buhmann

TL;DR

This work explores a novel regularization approach for inverse reinforcement learning methods based on the causal invariance principle with the goal of improved reward function generalization and demonstrates superior policy performance when trained using the recovered reward functions in a transfer setting.

Abstract

Inverse reinforcement learning methods aim to retrieve the reward function of a Markov decision process based on a dataset of expert demonstrations. The commonplace scarcity and heterogeneous sources of such demonstrations can lead to the absorption of spurious correlations in the data by the learned reward function. Consequently, this adaptation often exhibits behavioural overfitting to the expert data set when a policy is trained on the obtained reward function under distribution shift of the environment dynamics. In this work, we explore a novel regularization approach for inverse reinforcement learning methods based on the causal invariance principle with the goal of improved reward function generalization. By applying this regularization to both exact and approximate formulations of the learning task, we demonstrate superior policy performance when trained using the recovered reward functions in a transfer setting

Learning Causally Invariant Reward Functions from Diverse Demonstrations

TL;DR

Abstract

Paper Structure (25 sections, 4 theorems, 31 equations, 10 figures, 4 tables, 2 algorithms)

This paper contains 25 sections, 4 theorems, 31 equations, 10 figures, 4 tables, 2 algorithms.

Introduction
Method
Problem setting
Variational dual formulation.
Spurious correlations and causal invariance approaches
Transition SCM.
Dataset partitioning and training settings.
From robustness to invariance.
Reward regularization using causal invariance
Experiments
Tractable setting: Gridworld experiments using feature matching
Adversarial setting
Related work
Conclusion
Limitations and future work.
...and 10 more sections

Key Result

Proposition 1

Let the likelihood $p(\xi)$ belong to a natural exponential family with parameter $\psi$, sufficient statistics $\varphi(x)$ and the (Lebesgue) base measure $p_0$. Let ${\mathcal{D}}_E^e$ be the dataset corresponding to interventional setting $e$. Then, for all $e\in{\mathcal{E}}_{tr}$, the causal i

Figures (10)

Figure 1: (a) Probabilistic graphical model of a transition under influence of the index variable $E$ and latent variable $C$. The invariant conditional is highlighted in blue. (b) General setting where $O_t$ depends on causal $\mathbf{x}^{(c)}$ and non-causal $\mathbf{x}^{(nc)}$ features of the transition. Conditioning on the collider variable (in orange) creates a spurious correlation path. (c) Collider conditioning assuming wrong edge orientation $O_t \to s_{t+1}$. This corresponds to $O_t$ being the causal parent of $s_{t+1}$ (d) Spurious correlations arising under assumption of state-only formulation of the reward. Since $a_t$ is unobserved, a backdoor path (in red) is formed.
Figure 2: Feature matching reward recovery on a gridworld environment. (a) expert trajectory datasets: 1st group (blue) 400 trajectories, 2nd group (white): 25 trajectories, 3rd group (green): 3 trajectories. (b) MaxEnt IRL ERM baseline (c) MaxEnt IRL ERM baseline with L2 regularization coefficent $\lambda_{L2} = 1e-3$ (d) MaxEnt IRL with CI penalty, $\lambda_{I}=0.01$, (e) MaxEnt IRL with CI penalty, $\lambda_{I}=0.05$
Figure 3: Comparison of SAC policy performance w.r.t. ground truth reward when trained on inferred reward functions. Every row depicts a different type of dynamics perturbation for the five MuJoCo tasks as described in \ref{['sec:exp_dual']}. Here, AIRL is chosen as the baseline algorithm. The variants correspond to the unregularized baseline: erm, Lipschitz regularization: lip and three best CI regularization parameters ci.
Figure 4: Comparison of SAC policy performance w.r.t. ground truth reward when trained on recovered reward functions as a function of perturbation magnitude of the body mass parameter. Here, AIRL is chosen as the baseline algorithm. erm denotes the unregularized baseline, lip the best Lipschitz regularization hyperparameters per environment and ci the best causal invariance regularization hyperperameters.
Figure 5: Feature matching reward recovery on a gridworld environment. (a) expert trajectory datasets: every color represents a modality containing 50 trajectories (b) MaxEnt IRL ERM baseline (c) MaxEnt IRL ERM baseline with L2 regularization coefficent $\lambda_{L2} = 1e-3$ (d) MaxEnt IRL baseline with spectral norm (Lipschitz) regularization (e) MaxEnt IRL with CI penalty, $\lambda_{I}=0.1$, (f) MaxEnt IRL with CI penalty, $\lambda_{I}=0.5$
...and 5 more figures

Theorems & Definitions (7)

Definition 1
Proposition 1
Proposition 2
Proposition 2
proof
Proposition 2
proof

Learning Causally Invariant Reward Functions from Diverse Demonstrations

TL;DR

Abstract

Learning Causally Invariant Reward Functions from Diverse Demonstrations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (7)