Table of Contents
Fetching ...

STARC: A General Framework For Quantifying Differences Between Reward Functions

Joar Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, Alessandro Abate

TL;DR

STARC metrics provide a principled, theory-backed framework to quantify differences between reward functions in reinforcement learning by collapsing transformations that do not affect policy ordering and measuring distance in a canonicalised space. They establish soundness and completeness, linking small STARC distance to small worst-case regret and proving bilipschitz equivalence with any metric that shares these guarantees. Empirically, STARC outperforms prior metrics like EPIC and DARD in both large random MDPs and a continuous Reacher task, enabling more reliable evaluation of reward-learning algorithms. The approach offers a practical, closed-form-compatible tool for analysis and comparison, with clear implications for theoretical guarantees and empirical benchmarking.

Abstract

In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use reward learning algorithms, which attempt to learn a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to anticipate in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.

STARC: A General Framework For Quantifying Differences Between Reward Functions

TL;DR

STARC metrics provide a principled, theory-backed framework to quantify differences between reward functions in reinforcement learning by collapsing transformations that do not affect policy ordering and measuring distance in a canonicalised space. They establish soundness and completeness, linking small STARC distance to small worst-case regret and proving bilipschitz equivalence with any metric that shares these guarantees. Empirically, STARC outperforms prior metrics like EPIC and DARD in both large random MDPs and a continuous Reacher task, enabling more reliable evaluation of reward-learning algorithms. The approach offers a practical, closed-form-compatible tool for analysis and comparison, with clear implications for theoretical guarantees and empirical benchmarking.

Abstract

In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use reward learning algorithms, which attempt to learn a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to anticipate in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.
Paper Structure (32 sections, 34 theorems, 77 equations, 13 figures, 7 tables)

This paper contains 32 sections, 34 theorems, 77 equations, 13 figures, 7 tables.

Key Result

Proposition 1

$(S,A,\tau,\mu_0,R_1,\gamma)$ and $(S,A,\tau,\mu_0,R_2,\gamma)$ have the same ordering of policies if and only if $R_1$ and $R_2$ differ by potential shaping, positive linear scaling, and $S'$-redistribution.

Figures (13)

  • Figure 1: This figure displays the correlation to regret for several pseudometrics. Each point represents one pseudometric, i.e. one unique combination of canonicalisation $c$, normalisation $n$, and distance metric $m$. They are grouped together based on their canonicalisation function, with each column corresponding to a different canonicalisation function. Pseudometrics which skip canonicalisation or normalisation are shown in grey. The versions of EPIC and DARD that use the $L_2$ norm for both normalisation $n$ and distance metric $m$ are highlighted in red, as these are the original versions given in epic and dard. The STARC metrics, which are canonicalised using VAL, are reliably better indicators of regret than the other pseudometrics.
  • Figure :
  • Figure :
  • Figure :
  • Figure :
  • ...and 8 more figures

Theorems & Definitions (64)

  • Proposition 1
  • Definition 1
  • Definition 2
  • Definition 3
  • Proposition 2
  • Definition 4
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Proposition 6
  • ...and 54 more