Table of Contents
Fetching ...

Utility-Based Reinforcement Learning: Unifying Single-objective and Multi-objective Reinforcement Learning

Peter Vamplew, Cameron Foale, Conor F. Hayes, Patrick Mannion, Enda Howley, Richard Dazeley, Scott Johnson, Johan Källström, Gabriel Ramos, Roxana Rădulescu, Willem Röpke, Diederik M. Roijers

TL;DR

This work introduces utility-based reinforcement learning (UBRL) as a unifying framework that extends single-objective RL (SORL) to accommodate multi-objective settings through scalarization via a utility function $u$. By modeling environments as MOMDPs with vector rewards $\mathbf{R}$ and defining scalar criteria such as SER and ESR, UBRL subsumes standard RL (when $n=1$ and $u$ is identity) and enables multi-policy learning across diverse utility definitions. The authors discuss two practical formulations and advocate applying UBRL to SORL to achieve benefits like simplified reward design, risk-sensitive behaviour, and safe, satisficing actions, illustrated through strategies such as multi-policy risk preferences (CVaR), discounting, and non-monotonic utilities. They highlight algorithmic implications, including the need for non-linear utility handling, potential augmented state representations, and inner-loop learning to simultaneously optimize multiple policies, ultimately arguing that UBRL can accelerate knowledge transfer between MORL and SORL while expanding decision-maker control. The significance lies in offering a flexible, general framework that can capture a wide range of objectives and preferences, enabling post-hoc policy selection and more robust, human-aligned RL systems with potentially better sample efficiency and adaptability.

Abstract

Research in multi-objective reinforcement learning (MORL) has introduced the utility-based paradigm, which makes use of both environmental rewards and a function that defines the utility derived by the user from those rewards. In this paper we extend this paradigm to the context of single-objective reinforcement learning (RL), and outline multiple potential benefits including the ability to perform multi-policy learning across tasks relating to uncertain objectives, risk-aware RL, discounting, and safe RL. We also examine the algorithmic implications of adopting a utility-based approach.

Utility-Based Reinforcement Learning: Unifying Single-objective and Multi-objective Reinforcement Learning

TL;DR

This work introduces utility-based reinforcement learning (UBRL) as a unifying framework that extends single-objective RL (SORL) to accommodate multi-objective settings through scalarization via a utility function . By modeling environments as MOMDPs with vector rewards and defining scalar criteria such as SER and ESR, UBRL subsumes standard RL (when and is identity) and enables multi-policy learning across diverse utility definitions. The authors discuss two practical formulations and advocate applying UBRL to SORL to achieve benefits like simplified reward design, risk-sensitive behaviour, and safe, satisficing actions, illustrated through strategies such as multi-policy risk preferences (CVaR), discounting, and non-monotonic utilities. They highlight algorithmic implications, including the need for non-linear utility handling, potential augmented state representations, and inner-loop learning to simultaneously optimize multiple policies, ultimately arguing that UBRL can accelerate knowledge transfer between MORL and SORL while expanding decision-maker control. The significance lies in offering a flexible, general framework that can capture a wide range of objectives and preferences, enabling post-hoc policy selection and more robust, human-aligned RL systems with potentially better sample efficiency and adaptability.

Abstract

Research in multi-objective reinforcement learning (MORL) has introduced the utility-based paradigm, which makes use of both environmental rewards and a function that defines the utility derived by the user from those rewards. In this paper we extend this paradigm to the context of single-objective reinforcement learning (RL), and outline multiple potential benefits including the ability to perform multi-policy learning across tasks relating to uncertain objectives, risk-aware RL, discounting, and safe RL. We also examine the algorithmic implications of adopting a utility-based approach.
Paper Structure (12 sections, 9 equations)