Table of Contents
Fetching ...

Value function interference and greedy action selection in value-based multi-objective reinforcement learning

Peter Vamplew, Cameron Foale, Richard Dazeley

TL;DR

This paper identifies value-function interference as a fundamental challenge in value-based MORL when using non-linear utilities, showing that vector-valued Q-functions can misguide policy learning even when scalar utilities appear equivalent. It demonstrates that in deterministic settings ESR and SER align, but random tie-breaking among equally valued vector rewards can lead to sub-optimal convergence, and tests deterministic tie-breaking as a mitigation. Empirical results indicate deterministic tie-breaking reduces, but does not eliminate, interference, with gains largely due to removing stochastic selection rather than biases in tie-breaking order. The authors argue for distributional MORL or policy-search MORL as more robust approaches, especially for ESR optimization in stochastic environments, where learning the distribution over vector returns enables correct ESR evaluation.

Abstract

Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's utility with respect to the different objectives. However, as we demonstrate here, if the user's utility function maps widely varying vector-values to similar levels of utility, this can lead to interference in the value-function learned by the agent, leading to convergence to sub-optimal policies. This will be most prevalent in stochastic environments when optimising for the Expected Scalarised Return criterion, but we present a simple example showing that interference can also arise in deterministic environments. We demonstrate empirically that avoiding the use of random tie-breaking when identifying greedy actions can ameliorate, but not fully overcome, the problems caused by value function interference.

Value function interference and greedy action selection in value-based multi-objective reinforcement learning

TL;DR

This paper identifies value-function interference as a fundamental challenge in value-based MORL when using non-linear utilities, showing that vector-valued Q-functions can misguide policy learning even when scalar utilities appear equivalent. It demonstrates that in deterministic settings ESR and SER align, but random tie-breaking among equally valued vector rewards can lead to sub-optimal convergence, and tests deterministic tie-breaking as a mitigation. Empirical results indicate deterministic tie-breaking reduces, but does not eliminate, interference, with gains largely due to removing stochastic selection rather than biases in tie-breaking order. The authors argue for distributional MORL or policy-search MORL as more robust approaches, especially for ESR optimization in stochastic environments, where learning the distribution over vector returns enables correct ESR evaluation.

Abstract

Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's utility with respect to the different objectives. However, as we demonstrate here, if the user's utility function maps widely varying vector-values to similar levels of utility, this can lead to interference in the value-function learned by the agent, leading to convergence to sub-optimal policies. This will be most prevalent in stochastic environments when optimising for the Expected Scalarised Return criterion, but we present a simple example showing that interference can also arise in deterministic environments. We demonstrate empirically that avoiding the use of random tie-breaking when identifying greedy actions can ameliorate, but not fully overcome, the problems caused by value function interference.
Paper Structure (8 sections, 3 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 8 sections, 3 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: An example multi-objective MDP. The unlabelled states are terminal states, and the vector reward for reaching these states is indicated to their right. All other state transitions receive zero rewards.
  • Figure 2: Heatmaps showing the frequency with which each policy was selected as the final greedy policy across 100 trials using different hyperparameter settings for each of the three different tie-breaking approaches: random (top row), lower-indexed action (middle row), and higher-indexed action (bottom row)
  • Figure 3: Heatmaps showing the difference in the frequency with which the incorrect Policy 2 was selected as the final greedy policy across 100 trials using different hyperparameter settings for each combination of the three different tie-breaking approaches: Higher-index -random (top row); Lower-indexed - random (middle row); lower-index - higher-indexed (bottom row)
  • Figure 4: An example multi-objective MDP with stochastic rewards. The unlabelled states are terminal states, and the vector reward for reaching these states is indicated to their right.