Value function interference and greedy action selection in value-based multi-objective reinforcement learning
Peter Vamplew, Cameron Foale, Richard Dazeley
TL;DR
This paper identifies value-function interference as a fundamental challenge in value-based MORL when using non-linear utilities, showing that vector-valued Q-functions can misguide policy learning even when scalar utilities appear equivalent. It demonstrates that in deterministic settings ESR and SER align, but random tie-breaking among equally valued vector rewards can lead to sub-optimal convergence, and tests deterministic tie-breaking as a mitigation. Empirical results indicate deterministic tie-breaking reduces, but does not eliminate, interference, with gains largely due to removing stochastic selection rather than biases in tie-breaking order. The authors argue for distributional MORL or policy-search MORL as more robust approaches, especially for ESR optimization in stochastic environments, where learning the distribution over vector returns enables correct ESR evaluation.
Abstract
Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's utility with respect to the different objectives. However, as we demonstrate here, if the user's utility function maps widely varying vector-values to similar levels of utility, this can lead to interference in the value-function learned by the agent, leading to convergence to sub-optimal policies. This will be most prevalent in stochastic environments when optimising for the Expected Scalarised Return criterion, but we present a simple example showing that interference can also arise in deterministic environments. We demonstrate empirically that avoiding the use of random tie-breaking when identifying greedy actions can ameliorate, but not fully overcome, the problems caused by value function interference.
