Partial Identifiability in Inverse Reinforcement Learning For Agents With Non-Exponential Discounting

Joar Skalse; Alessandro Abate

Partial Identifiability in Inverse Reinforcement Learning For Agents With Non-Exponential Discounting

Joar Skalse, Alessandro Abate

TL;DR

This work addresses partial identifiability in inverse reinforcement learning when agents discount future rewards with non-exponential functions, notably hyperbolic discounting. It generalises Boltzmann-rationality to general discounting by introducing three policy classes—resolute, naive, and sophisticated—and their corresponding Boltzmann-style models using $Q^{\mathrm{R}}$, $Q^{\mathrm{N}}$, and $Q^{\mathrm{S}}$. The authors prove exact identifiability results: under Boltzmann-naïve and Boltzmann-sophisticated models, the reward $R$ is identifiable up to naïve or sophisticated potential shaping, respectively; however, non-exponential discounting renders robust identifiability across discount models problematic, as shown by qualitative results that across-model outputs may diverge even when one model’s outputs coincide. These findings imply that IRL alone may be insufficient to recover true agent preferences for humans or agents employing non-exponential discounting, affecting preference elicitation and policy interpretation in realistic settings. The work motivates further study of richer behavioural models, misspecification effects, and extensions beyond episodic MDPs to better capture human-like decision processes.

Abstract

The aim of inverse reinforcement learning (IRL) is to infer an agent's preferences from observing their behaviour. Usually, preferences are modelled as a reward function, $R$, and behaviour is modelled as a policy, $π$. One of the central difficulties in IRL is that multiple preferences may lead to the same observed behaviour. That is, $R$ is typically underdetermined by $π$, which means that $R$ is only partially identifiable. Recent work has characterised the extent of this partial identifiability for different types of agents, including optimal and Boltzmann-rational agents. However, work so far has only considered agents that discount future reward exponentially: this is a serious limitation, especially given that extensive work in the behavioural sciences suggests that humans are better modelled as discounting hyperbolically. In this work, we newly characterise partial identifiability in IRL for agents with non-exponential discounting: our results are in particular relevant for hyperbolical discounting, but they also more generally apply to agents that use other types of (non-exponential) discounting. We significantly show that generally IRL is unable to infer enough information about $R$ to identify the correct optimal policy, which entails that IRL alone can be insufficient to adequately characterise the preferences of such agents.

Partial Identifiability in Inverse Reinforcement Learning For Agents With Non-Exponential Discounting

TL;DR

, and

. The authors prove exact identifiability results: under Boltzmann-naïve and Boltzmann-sophisticated models, the reward

is identifiable up to naïve or sophisticated potential shaping, respectively; however, non-exponential discounting renders robust identifiability across discount models problematic, as shown by qualitative results that across-model outputs may diverge even when one model’s outputs coincide. These findings imply that IRL alone may be insufficient to recover true agent preferences for humans or agents employing non-exponential discounting, affecting preference elicitation and policy interpretation in realistic settings. The work motivates further study of richer behavioural models, misspecification effects, and extensions beyond episodic MDPs to better capture human-like decision processes.

Abstract

The aim of inverse reinforcement learning (IRL) is to infer an agent's preferences from observing their behaviour. Usually, preferences are modelled as a reward function,

, and behaviour is modelled as a policy,

. One of the central difficulties in IRL is that multiple preferences may lead to the same observed behaviour. That is,

is typically underdetermined by

, which means that

is only partially identifiable. Recent work has characterised the extent of this partial identifiability for different types of agents, including optimal and Boltzmann-rational agents. However, work so far has only considered agents that discount future reward exponentially: this is a serious limitation, especially given that extensive work in the behavioural sciences suggests that humans are better modelled as discounting hyperbolically. In this work, we newly characterise partial identifiability in IRL for agents with non-exponential discounting: our results are in particular relevant for hyperbolical discounting, but they also more generally apply to agents that use other types of (non-exponential) discounting. We significantly show that generally IRL is unable to infer enough information about

to identify the correct optimal policy, which entails that IRL alone can be insufficient to adequately characterise the preferences of such agents.

Partial Identifiability in Inverse Reinforcement Learning For Agents With Non-Exponential Discounting

TL;DR

Abstract

Partial Identifiability in Inverse Reinforcement Learning For Agents With Non-Exponential Discounting

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (81)