Table of Contents
Fetching ...

Contracting With a Reinforcement Learning Agent by Playing Trick or Treat

Matteo Bollini, Francesco Bacchiocchi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

TL;DR

An efficient algorithm is designed to compute an optimal policy that can be generally applied to any approximately-incentive-compatible policy, and it generalized a related approach that has already been discovered for classical principal-agent problems to more general settings in MDPs.

Abstract

We study principal-agent problems where a farsighted agent takes costly actions in an MDP. The core challenge in these settings is that agent's actions are hidden to the principal, who can only observe their outcomes, namely state transitions and their associated rewards. Thus, the principal's goal is to devise a policy that incentives the agent to take actions leading to desirable outcomes. This is accomplished by committing to a payment scheme (a.k.a. contract) at each step, specifying a monetary transfer from the principal to the agent for every possible outcome. Interestingly, we show that Markovian policies are unfit in these settings, as they do not allow to achieve the optimal principal's utility and are constitutionally intractable. Thus, accounting for history in unavoidable, and this begets considerable additional challenges compared to standard MDPs. Nevertheless, we design an efficient algorithm to compute an optimal policy, leveraging a compact way of representing histories for this purpose. Unfortunately, the policy produced by such an algorithm cannot be readily implemented, as it is only approximately incentive compatible, meaning that the agent is incentivized to take the desired actions only approximately. To fix this, we design an efficient method to make such a policy incentive compatible, by only introducing a negligible loss in principal's utility. This method can be generally applied to any approximately-incentive-compatible policy, and it generalized a related approach that has already been discovered for classical principal-agent problems to more general settings in MDPs.

Contracting With a Reinforcement Learning Agent by Playing Trick or Treat

TL;DR

An efficient algorithm is designed to compute an optimal policy that can be generally applied to any approximately-incentive-compatible policy, and it generalized a related approach that has already been discovered for classical principal-agent problems to more general settings in MDPs.

Abstract

We study principal-agent problems where a farsighted agent takes costly actions in an MDP. The core challenge in these settings is that agent's actions are hidden to the principal, who can only observe their outcomes, namely state transitions and their associated rewards. Thus, the principal's goal is to devise a policy that incentives the agent to take actions leading to desirable outcomes. This is accomplished by committing to a payment scheme (a.k.a. contract) at each step, specifying a monetary transfer from the principal to the agent for every possible outcome. Interestingly, we show that Markovian policies are unfit in these settings, as they do not allow to achieve the optimal principal's utility and are constitutionally intractable. Thus, accounting for history in unavoidable, and this begets considerable additional challenges compared to standard MDPs. Nevertheless, we design an efficient algorithm to compute an optimal policy, leveraging a compact way of representing histories for this purpose. Unfortunately, the policy produced by such an algorithm cannot be readily implemented, as it is only approximately incentive compatible, meaning that the agent is incentivized to take the desired actions only approximately. To fix this, we design an efficient method to make such a policy incentive compatible, by only introducing a negligible loss in principal's utility. This method can be generally applied to any approximately-incentive-compatible policy, and it generalized a related approach that has already been discovered for classical principal-agent problems to more general settings in MDPs.

Paper Structure

This paper contains 36 sections, 32 theorems, 181 equations, 3 figures, 8 algorithms.

Key Result

Proposition 3.0

There exists a problem instance in which a history-dependent policy provides a principal's cumulative expected utility larger than that of any Markovian policy.

Figures (3)

  • Figure 1: An instance where no Markovian policy is optimal.
  • Figure 2: Illustrative instance of a cubic graph with $V=\{v_1,v_2,v_3\}$ and $E=\{e_1,e_2\}$.
  • Figure 3: Instance of the Contract MDP problem constructed from the Cubic Graph depict in Figure \ref{['fig:cubic_graph']}. Costs and rewards are reported only when different from zero. The states $s_{v_1}$, $s_{v_2}$ and $s_{v_3}$ share the same set of actions $\{a_1,a_2,a_3\}$.

Theorems & Definitions (60)

  • Definition 2.1: $\epsilon$-IC policies
  • Definition 2.2: Direct policies
  • Proposition 3.0
  • Theorem 3.1
  • Lemma 4.0
  • Definition 4.1: $\eta$-honesty
  • Lemma 4.1
  • Lemma 4.1
  • Theorem 4.2
  • Definition 5.1: Agent's function
  • ...and 50 more