Occam's razor is insufficient to infer the preferences of irrational agents
Stuart Armstrong, Sören Mindermann
TL;DR
The paper shows that inferring human rewards from behavior via inverse reinforcement learning is fundamentally underdetermined: a No Free Lunch result proves that there is no unique decomposition of a policy into a planner and a reward, and simple priors like Occam's razor do not resolve this. It demonstrates the existence of degenerate, low-complexity decompositions that fit any observed policy, and argues that genuine human reward functions are intrinsically high in complexity due to biases, contingency, and cultural variation. The authors argue that to make progress, we must adopt normative assumptions about planner and reward structure that cannot be derived from observations alone, and they outline avenues for identifying minimal, shared priors. This work highlights the practical limits of IRL for value alignment in irrational agents and motivates explicit normative frameworks to guide reward inference and policy evaluation, including considerations of manipulation and override of human preferences.
Abstract
Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings. However, the general problem of inferring the reward function of an agent of unknown rationality has received little attention. Unlike the well-known ambiguity problems in IRL, this one is practically relevant but cannot be resolved by observing the agent's policy in enough environments. This paper shows (1) that a No Free Lunch result implies it is impossible to uniquely decompose a policy into a planning algorithm and reward function, and (2) that even with a reasonable simplicity prior/Occam's razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret. To address this, we need simple `normative' assumptions, which cannot be deduced exclusively from observations.
