Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch

Malek Mechergui; Sarath Sreedharan

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch

Malek Mechergui, Sarath Sreedharan

TL;DR

This work uses the theory of mind, i.e., the human user's beliefs about the AI agent, as a basis to develop a formal explanatory framework called Expectation Alignment (EAL), which provides concrete insights into the limitations of existing methods to handle reward misspecification and novel solution strategies.

Abstract

Detecting and handling misspecified objectives, such as reward functions, has been widely recognized as one of the central challenges within the domain of Artificial Intelligence (AI) safety research. However, even with the recognition of the importance of this problem, we are unaware of any works that attempt to provide a clear definition for what constitutes (a) misspecified objectives and (b) successfully resolving such misspecifications. In this work, we use the theory of mind, i.e., the human user's beliefs about the AI agent, as a basis to develop a formal explanatory framework called Expectation Alignment (EAL) to understand the objective misspecification and its causes. Our EAL framework not only acts as an explanatory framework for existing works but also provides us with concrete insights into the limitations of existing methods to handle reward misspecification and novel solution strategies. We use these insights to propose a new interactive algorithm that uses the specified reward to infer potential user expectations about the system behavior. We show how one can efficiently implement this algorithm by mapping the inference problem into linear programs. We evaluate our method on a set of standard Markov Decision Process (MDP) benchmarks.

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch

TL;DR

Abstract

Paper Structure (11 sections, 11 theorems, 8 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 11 sections, 11 theorems, 8 equations, 1 figure, 1 table, 1 algorithm.

Introduction
Background
Expectation Alignment Framework
Identifying Expectation-Aligned Policies
Related Works
Evaluation
Conclusion
Appendix / supplemental material
Proof Sketch for Propositions
Noisy Rational Model
Broader Impact

Key Result

Theorem 1

There may be human and robot domains, $\mathcal{D}^H$ and $\mathcal{D}^R$, and an expectation set $\mathbb{E}^H$, such that one can never come up with a human-sufficient reward function $\mathcal{R}$ that is not misspecified with respect to $\mathcal{D}^R$, even if one allows the human planning func

Figures (1)

Figure 1: A diagrammatic overview of how specifying a reward function plays a role in whether or not their expectations are met.

Theorems & Definitions (23)

Definition 1
Definition 2
Definition 3
Definition 4
Definition 5
Theorem 1
proof : Proof Sketch
Definition 6
Proposition 1
Proposition 2
...and 13 more

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch

TL;DR

Abstract

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (23)