Utility-Based Reinforcement Learning: Unifying Single-objective and Multi-objective Reinforcement Learning

Peter Vamplew; Cameron Foale; Conor F. Hayes; Patrick Mannion; Enda Howley; Richard Dazeley; Scott Johnson; Johan Källström; Gabriel Ramos; Roxana Rădulescu; Willem Röpke; Diederik M. Roijers

Utility-Based Reinforcement Learning: Unifying Single-objective and Multi-objective Reinforcement Learning

Peter Vamplew, Cameron Foale, Conor F. Hayes, Patrick Mannion, Enda Howley, Richard Dazeley, Scott Johnson, Johan Källström, Gabriel Ramos, Roxana Rădulescu, Willem Röpke, Diederik M. Roijers

TL;DR

This work introduces utility-based reinforcement learning (UBRL) as a unifying framework that extends single-objective RL (SORL) to accommodate multi-objective settings through scalarization via a utility function $u$. By modeling environments as MOMDPs with vector rewards $\mathbf{R}$ and defining scalar criteria such as SER and ESR, UBRL subsumes standard RL (when $n=1$ and $u$ is identity) and enables multi-policy learning across diverse utility definitions. The authors discuss two practical formulations and advocate applying UBRL to SORL to achieve benefits like simplified reward design, risk-sensitive behaviour, and safe, satisficing actions, illustrated through strategies such as multi-policy risk preferences (CVaR), discounting, and non-monotonic utilities. They highlight algorithmic implications, including the need for non-linear utility handling, potential augmented state representations, and inner-loop learning to simultaneously optimize multiple policies, ultimately arguing that UBRL can accelerate knowledge transfer between MORL and SORL while expanding decision-maker control. The significance lies in offering a flexible, general framework that can capture a wide range of objectives and preferences, enabling post-hoc policy selection and more robust, human-aligned RL systems with potentially better sample efficiency and adaptability.

Abstract

Research in multi-objective reinforcement learning (MORL) has introduced the utility-based paradigm, which makes use of both environmental rewards and a function that defines the utility derived by the user from those rewards. In this paper we extend this paradigm to the context of single-objective reinforcement learning (RL), and outline multiple potential benefits including the ability to perform multi-policy learning across tasks relating to uncertain objectives, risk-aware RL, discounting, and safe RL. We also examine the algorithmic implications of adopting a utility-based approach.

Utility-Based Reinforcement Learning: Unifying Single-objective and Multi-objective Reinforcement Learning

TL;DR

. By modeling environments as MOMDPs with vector rewards

and defining scalar criteria such as SER and ESR, UBRL subsumes standard RL (when

and

is identity) and enables multi-policy learning across diverse utility definitions. The authors discuss two practical formulations and advocate applying UBRL to SORL to achieve benefits like simplified reward design, risk-sensitive behaviour, and safe, satisficing actions, illustrated through strategies such as multi-policy risk preferences (CVaR), discounting, and non-monotonic utilities. They highlight algorithmic implications, including the need for non-linear utility handling, potential augmented state representations, and inner-loop learning to simultaneously optimize multiple policies, ultimately arguing that UBRL can accelerate knowledge transfer between MORL and SORL while expanding decision-maker control. The significance lies in offering a flexible, general framework that can capture a wide range of objectives and preferences, enabling post-hoc policy selection and more robust, human-aligned RL systems with potentially better sample efficiency and adaptability.

Abstract

Paper Structure (12 sections, 9 equations)

This paper contains 12 sections, 9 equations.

Introduction
Formalising utility-based RL
MDPs, MOMDPs and Optimisation criteria
Utility-based RL as a general framework
Motivation for utility-based SORL
Potential single-objective applications of utility-based RL
Multi-policy methods for hard-to-define objectives
Multi-policy risk-aware RL
Multi-policy discounting
Satisficing agents
Implications of non-linear utility
Conclusion

Utility-Based Reinforcement Learning: Unifying Single-objective and Multi-objective Reinforcement Learning

TL;DR

Abstract

Utility-Based Reinforcement Learning: Unifying Single-objective and Multi-objective Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents