Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification

Joar Skalse; Alessandro Abate

Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification

Joar Skalse, Alessandro Abate

TL;DR

This paper interrogates how robust inverse reinforcement learning is to misspecification of the behavioural model that maps rewards to policies. It introduces a formal misspecification framework and the STARC metric to quantify when two reward functions yield indistinguishable or opposing policy orders, enabling necessary and sufficient conditions for robustness. The authors show that, under mild misspecifications, many common IRL models (optimality, Boltzmann, maximal causal entropy) can produce large errors in the inferred reward, and that perturbations in discounting or environment dynamics largely disrupt robustness, while certain parameter misspecifications (e.g., $\beta$ for Boltzmann or $\alpha$ for MCE) are tolerable. The results highlight fundamental limitations of IRL in real-world preference elicitation and motivate designing more robust IRL methods or integrating additional data sources to mitigate misspecification risks.

Abstract

Inverse reinforcement learning (IRL) aims to infer an agent's preferences (represented as a reward function $R$) from their behaviour (represented as a policy $π$). To do this, we need a behavioural model of how $π$ relates to $R$. In the current literature, the most common behavioural models are optimality, Boltzmann-rationality, and causal entropy maximisation. However, the true relationship between a human's preferences and their behaviour is much more complex than any of these behavioural models. This means that the behavioural models are misspecified, which raises the concern that they may lead to systematic errors if applied to real data. In this paper, we analyse how sensitive the IRL problem is to misspecification of the behavioural model. Specifically, we provide necessary and sufficient conditions that completely characterise how the observed data may differ from the assumed behavioural model without incurring an error above a given threshold. In addition to this, we also characterise the conditions under which a behavioural model is robust to small perturbations of the observed policy, and we analyse how robust many behavioural models are to misspecification of their parameter values (such as e.g.\ the discount rate). Our analysis suggests that the IRL problem is highly sensitive to misspecification, in the sense that very mild misspecification can lead to very large errors in the inferred reward function.

Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification

TL;DR

for Boltzmann or

for MCE) are tolerable. The results highlight fundamental limitations of IRL in real-world preference elicitation and motivate designing more robust IRL methods or integrating additional data sources to mitigate misspecification risks.

Abstract

Inverse reinforcement learning (IRL) aims to infer an agent's preferences (represented as a reward function

) from their behaviour (represented as a policy

). To do this, we need a behavioural model of how

relates to

. In the current literature, the most common behavioural models are optimality, Boltzmann-rationality, and causal entropy maximisation. However, the true relationship between a human's preferences and their behaviour is much more complex than any of these behavioural models. This means that the behavioural models are misspecified, which raises the concern that they may lead to systematic errors if applied to real data. In this paper, we analyse how sensitive the IRL problem is to misspecification of the behavioural model. Specifically, we provide necessary and sufficient conditions that completely characterise how the observed data may differ from the assumed behavioural model without incurring an error above a given threshold. In addition to this, we also characterise the conditions under which a behavioural model is robust to small perturbations of the observed policy, and we analyse how robust many behavioural models are to misspecification of their parameter values (such as e.g.\ the discount rate). Our analysis suggests that the IRL problem is highly sensitive to misspecification, in the sense that very mild misspecification can lead to very large errors in the inferred reward function.

Paper Structure (26 sections, 24 theorems, 10 equations, 6 figures)

This paper contains 26 sections, 24 theorems, 10 equations, 6 figures.

Introduction
Related Work
Preliminaries
Theoretical Framework
Defining Misspecification Robustness
Reward Function Metrics
Background Results
Misspecification Robustness
Necessary and Sufficient Conditions
Perturbation Robustness
Misspecified Parameters
Discussion
Motivating Our Definition of Misspecification Robustness
Additional Comments On the Conditions For Misspecification Robustness
On the Assumption That Behavioural Models Are Functions
...and 11 more sections

Key Result

Proposition 1

$(\mathcal{S}, \mathcal{A}, \tau, \mu_0, R_1, \gamma)$ and $(\mathcal{S}, \mathcal{A}, \tau, \mu_0, R_2, \gamma)$ have the same ordering of policies if and only if $R_1$ and $R_2$ differ by potential shaping (with $\gamma$), $S'$-redistribution (with $\tau$), and positive linear scaling.

Figures (6)

Figure :
Figure :
Figure :
Figure :
Figure :
...and 1 more figures

Theorems & Definitions (42)

Definition 1
Definition 2
Proposition 1
Proposition 2
Theorem 1
Proposition 3
Corollary 1
Proposition 4
Definition 3
Definition 4
...and 32 more

Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification

TL;DR

Abstract

Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (42)