Value function interference and greedy action selection in value-based multi-objective reinforcement learning

Peter Vamplew; Cameron Foale; Richard Dazeley

Value function interference and greedy action selection in value-based multi-objective reinforcement learning

Peter Vamplew, Cameron Foale, Richard Dazeley

TL;DR

This paper identifies value-function interference as a fundamental challenge in value-based MORL when using non-linear utilities, showing that vector-valued Q-functions can misguide policy learning even when scalar utilities appear equivalent. It demonstrates that in deterministic settings ESR and SER align, but random tie-breaking among equally valued vector rewards can lead to sub-optimal convergence, and tests deterministic tie-breaking as a mitigation. Empirical results indicate deterministic tie-breaking reduces, but does not eliminate, interference, with gains largely due to removing stochastic selection rather than biases in tie-breaking order. The authors argue for distributional MORL or policy-search MORL as more robust approaches, especially for ESR optimization in stochastic environments, where learning the distribution over vector returns enables correct ESR evaluation.

Abstract

Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's utility with respect to the different objectives. However, as we demonstrate here, if the user's utility function maps widely varying vector-values to similar levels of utility, this can lead to interference in the value-function learned by the agent, leading to convergence to sub-optimal policies. This will be most prevalent in stochastic environments when optimising for the Expected Scalarised Return criterion, but we present a simple example showing that interference can also arise in deterministic environments. We demonstrate empirically that avoiding the use of random tie-breaking when identifying greedy actions can ameliorate, but not fully overcome, the problems caused by value function interference.

Value function interference and greedy action selection in value-based multi-objective reinforcement learning

TL;DR

Abstract

Paper Structure (8 sections, 3 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 8 sections, 3 equations, 4 figures, 1 table, 1 algorithm.

Introduction and Background
Interference in multi-objective value functions for deterministic environments
An example of value function interference in a stochastic MOMDP
Empirical evaluation of a potential solution
Random tie-breaking
Deterministic tie-breaking
Interference in multi-objective value functions for stochastic environments
Conclusion

Figures (4)

Figure 1: An example multi-objective MDP. The unlabelled states are terminal states, and the vector reward for reaching these states is indicated to their right. All other state transitions receive zero rewards.
Figure 2: Heatmaps showing the frequency with which each policy was selected as the final greedy policy across 100 trials using different hyperparameter settings for each of the three different tie-breaking approaches: random (top row), lower-indexed action (middle row), and higher-indexed action (bottom row)
Figure 3: Heatmaps showing the difference in the frequency with which the incorrect Policy 2 was selected as the final greedy policy across 100 trials using different hyperparameter settings for each combination of the three different tie-breaking approaches: Higher-index -random (top row); Lower-indexed - random (middle row); lower-index - higher-indexed (bottom row)
Figure 4: An example multi-objective MDP with stochastic rewards. The unlabelled states are terminal states, and the vector reward for reaching these states is indicated to their right.

Value function interference and greedy action selection in value-based multi-objective reinforcement learning

TL;DR

Abstract

Value function interference and greedy action selection in value-based multi-objective reinforcement learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)