Table of Contents
Fetching ...

Inverse Concave-Utility Reinforcement Learning is Inverse Game Theory

Mustafa Mert Çelikok, Frans A. Oliehoek, Jan-Willem van de Meent

TL;DR

This work addresses the challenge of inferring reward functions when agents optimize a concave utility in reinforcement learning (CURL), a setting where traditional IRL fails due to nonlinearity and lack of Bellman structure. By leveraging the known equivalence between CURL and a subclass of mean-field games (CURL-MFG), the authors formulate inverse CURL as an inverse game theory problem in MFGs and define a game-theoretic feasible reward set that exactly characterizes rewards yielding the observed mean-field Nash equilibrium. They develop a saddle-point (min–max) characterization over a parametric reward class to compute feasible rewards and discuss empirical considerations via an approximate, data-driven approach (Empirical I-CURL) and the Adversarial Inverse Multi-Agent Planning (AIMP) algorithm. The framework enables recovering reward structures that encode bounded rationality phenomena (e.g., information- or risk-bounded decisions) and supports human–AI collaboration by providing interpretable reward descriptions for CURL-style behaviors. This work thus fills a theoretical gap in inverse RL by extending the inverse-IRL paradigm to CURL via inverse game theory in mean-field settings, with clear directions for future empirical validation and extensions to broader bounded-rationality models.

Abstract

We consider inverse reinforcement learning problems with concave utilities. Concave Utility Reinforcement Learning (CURL) is a generalisation of the standard RL objective, which employs a concave function of the state occupancy measure, rather than a linear function. CURL has garnered recent attention for its ability to represent instances of many important applications including the standard RL such as imitation learning, pure exploration, constrained MDPs, offline RL, human-regularized RL, and others. Inverse reinforcement learning is a powerful paradigm that focuses on recovering an unknown reward function that can rationalize the observed behaviour of an agent. There has been recent theoretical advances in inverse RL where the problem is formulated as identifying the set of feasible reward functions. However, inverse RL for CURL problems has not been considered previously. In this paper we show that most of the standard IRL results do not apply to CURL in general, since CURL invalidates the classical Bellman equations. This calls for a new theoretical framework for the inverse CURL problem. Using a recent equivalence result between CURL and Mean-field Games, we propose a new definition for the feasible rewards for I-CURL by proving that this problem is equivalent to an inverse game theory problem in a subclass of mean-field games. We outline future directions and applications in human--AI collaboration enabled by our results.

Inverse Concave-Utility Reinforcement Learning is Inverse Game Theory

TL;DR

This work addresses the challenge of inferring reward functions when agents optimize a concave utility in reinforcement learning (CURL), a setting where traditional IRL fails due to nonlinearity and lack of Bellman structure. By leveraging the known equivalence between CURL and a subclass of mean-field games (CURL-MFG), the authors formulate inverse CURL as an inverse game theory problem in MFGs and define a game-theoretic feasible reward set that exactly characterizes rewards yielding the observed mean-field Nash equilibrium. They develop a saddle-point (min–max) characterization over a parametric reward class to compute feasible rewards and discuss empirical considerations via an approximate, data-driven approach (Empirical I-CURL) and the Adversarial Inverse Multi-Agent Planning (AIMP) algorithm. The framework enables recovering reward structures that encode bounded rationality phenomena (e.g., information- or risk-bounded decisions) and supports human–AI collaboration by providing interpretable reward descriptions for CURL-style behaviors. This work thus fills a theoretical gap in inverse RL by extending the inverse-IRL paradigm to CURL via inverse game theory in mean-field settings, with clear directions for future empirical validation and extensions to broader bounded-rationality models.

Abstract

We consider inverse reinforcement learning problems with concave utilities. Concave Utility Reinforcement Learning (CURL) is a generalisation of the standard RL objective, which employs a concave function of the state occupancy measure, rather than a linear function. CURL has garnered recent attention for its ability to represent instances of many important applications including the standard RL such as imitation learning, pure exploration, constrained MDPs, offline RL, human-regularized RL, and others. Inverse reinforcement learning is a powerful paradigm that focuses on recovering an unknown reward function that can rationalize the observed behaviour of an agent. There has been recent theoretical advances in inverse RL where the problem is formulated as identifying the set of feasible reward functions. However, inverse RL for CURL problems has not been considered previously. In this paper we show that most of the standard IRL results do not apply to CURL in general, since CURL invalidates the classical Bellman equations. This calls for a new theoretical framework for the inverse CURL problem. Using a recent equivalence result between CURL and Mean-field Games, we propose a new definition for the feasible rewards for I-CURL by proving that this problem is equivalent to an inverse game theory problem in a subclass of mean-field games. We outline future directions and applications in human--AI collaboration enabled by our results.
Paper Structure (16 sections, 5 theorems, 5 equations, 1 figure, 1 table)

This paper contains 16 sections, 5 theorems, 5 equations, 1 figure, 1 table.

Key Result

Lemma 4

There exists an MDP $M$ with concave utility $F$ such that there can be no stationary reward $R \in \mathbb{R}^{S \times A}$ with $\mathop{\mathrm{argmax}}\limits_{d^\pi \in \mathcal{K}_\gamma} \langle d^\pi, R \rangle = \mathop{\mathrm{argmax}}\limits_{d^\pi \in \mathcal{K}_\gamma} F(d^\pi).$

Figures (1)

  • Figure 1: Illustrative example of an information-limited MDP. A captain must navigate to one of the ports (a, b, c). The sharks denote dangerous waters with highly negative reward. The most preferred port is $a$ and the least preferred is $c$. The environment is not stochastic, but a cognitively constrained captain can make mistakes. The captain chooses to go to port B, knowing they could make mistakes when following the golden path to a. The green-shaded area shows the paths captain follow to b. They make a special effort to avoid c, but otherwise follow a noisy path.

Theorems & Definitions (11)

  • Definition 1: Markov Decision Process
  • Definition 2: Discounted State-Action Occupancy Measure
  • Definition 3: Set of Discounted State-Action Occupancy Measures
  • Lemma 4
  • Definition 5: Mean-field Nash Equilibrium (MFNE) and Exploitability
  • Definition 6: Feasible Rewards Set for IRL metelli2021provablymetelli2023towardslindner2022active
  • Lemma 7
  • Proposition 8: Feasible Reward Set for I-CURL
  • Proposition 10
  • Corollary 12
  • ...and 1 more