Principal-Agent Reward Shaping in MDPs

Omer Ben-Porat; Yishay Mansour; Michal Moshkovitz; Boaz Taitler

Principal-Agent Reward Shaping in MDPs

Omer Ben-Porat, Yishay Mansour, Michal Moshkovitz, Boaz Taitler

TL;DR

This work studies reward shaping in principal-agent settings over Markov decision processes, formulating PARS-MDP where a Principal budgets a nonnegative bonus $R^B$ to influence the Agent's policy. The Agent responds with a best response to $R^A+R^B$, while the Principal aims to maximize $V^P$ via the induced policy, under a total budget $B$; the general problem is NP-hard. The authors develop two approximation schemes: STAR for stochastic trees, achieving a fully polynomial-time approximation with budget inflation and an $O(|A||S|k(B/\varepsilon)^3)$ runtime, and DFAR for deterministic finite-horizon DDPs, which yields optimal results under $\varepsilon$-discretization and a bi-criteria bound otherwise. Simulations on generated layered MDPs validate the theory, showing the principal’s utility improving toward the optimum as discretization tightens and budgets increase. Overall, the paper provides provable guarantees for practical reward-shaping strategies in sequential principal-agent settings with applications to environment design and incentive mechanisms in complex decision processes.

Abstract

Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest. The economic literature has extensively studied principal-agent problems, and recent work has extended this to more complex scenarios such as Markov Decision Processes (MDPs). In this paper, we further explore this line of research by investigating how reward shaping under budget constraints can improve the principal's utility. We study a two-player Stackelberg game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players. The principal offers an additional reward to the agent, and the agent picks their policy selfishly to maximize their reward, which is the sum of the original and the offered reward. Our results establish the NP-hardness of the problem and offer polynomial approximation algorithms for two classes of instances: Stochastic trees and deterministic decision processes with a finite horizon.

Principal-Agent Reward Shaping in MDPs

TL;DR

This work studies reward shaping in principal-agent settings over Markov decision processes, formulating PARS-MDP where a Principal budgets a nonnegative bonus

to influence the Agent's policy. The Agent responds with a best response to

, while the Principal aims to maximize

via the induced policy, under a total budget

; the general problem is NP-hard. The authors develop two approximation schemes: STAR for stochastic trees, achieving a fully polynomial-time approximation with budget inflation and an

runtime, and DFAR for deterministic finite-horizon DDPs, which yields optimal results under

-discretization and a bi-criteria bound otherwise. Simulations on generated layered MDPs validate the theory, showing the principal’s utility improving toward the optimum as discretization tightens and budgets increase. Overall, the paper provides provable guarantees for practical reward-shaping strategies in sequential principal-agent settings with applications to environment design and incentive mechanisms in complex decision processes.

Abstract

Paper Structure (38 sections, 24 theorems, 50 equations, 3 figures, 4 algorithms)

This paper contains 38 sections, 24 theorems, 50 equations, 3 figures, 4 algorithms.

Introduction
Our Contribution
Related Work
Model
$\textnormal{PARS-MDP}$ as an Optimization Problem
Warmup Examples
Implementable Policies
Stochastic Trees
The $\small\textnormal{STAR}$ Algorithm
Deterministic Decision Processes with Finite Horizon
The $\small\textnormal{DFAR}$ Algorithm
Cyclic Deterministic Decision Processes
Simulations
Generating MDPs
Algorithms
...and 23 more sections

Key Result

Theorem 1

Let the underlying MDP be a $k$-ary tree of depth $H$ and let $V^P_\star$ be the optimal utility of Principal's problem with budget $B$. Given any small positive constant $\alpha$, our algorithm alg: general stochastic tree UP-MDP guarantees a utility of at least $V^P_\star$ by using a budget of $B(

Figures (3)

Figure 1: Instances for Examples \ref{['example:dag']} and \ref{['example:sketch']}. In both figures, Agent's (Principal's) reward is described in red (blue) next to each edge. Figure \ref{['fig:example1']} describes an acyclic graph with deterministic transitions, and Figure \ref{['fig:example2']} describes a stochastic tree of depth 2.
Figure 2: Simulation results. Figure \ref{['simulation fig1']} presents the relationship between Principal's rewards and the approximation factor. Figure \ref{['simulation fig2']} describes the connection between Principal's rewards and the budget.
Figure 3: An illustrative instance to demonstrate the non-continuous relationship between the principal's utility and budget allocation.

Theorems & Definitions (45)

Theorem 1: Informal statement of Theorem \ref{['thm: general stochastic tree UP-MDP']}
Theorem 2: Informal statement of Theorem \ref{['thm:alg_deterministic_dag_UP_MDP_approximation']}
Theorem 3
Example 1
Example 2
Definition 1: $B$-implementable policy
Definition 2: Minimal implementation
Theorem 4
Theorem 5
Corollary 1
...and 35 more

Principal-Agent Reward Shaping in MDPs

TL;DR

Abstract

Principal-Agent Reward Shaping in MDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (45)