Table of Contents
Fetching ...

An Empirical Investigation of Value-Based Multi-objective Reinforcement Learning for Stochastic Environments

Kewen Ding, Peter Vamplew, Cameron Foale, Richard Dazeley

TL;DR

This paper investigates the challenges of learning the SER-optimal policy in stochastic MOMDPs using value-based MORL. By evaluating baseline MOQ-learning, reward engineering, MOSS with global statistics, and policy-options on the Space Traders benchmark, it highlights how noisy Q-value estimates and local decision-making impede convergence to SER-optimal policies. The study shows partial gains from each approach, with decaying learning rates mitigating noise and policy options offering the strongest improvement in a small setting, yet none providing a universal solution or scalable applicability. The findings suggest that future progress will likely require policy-gradient or distributional reinforcement learning paradigms that can jointly address local decision dynamics and uncertain returns in larger, real-world MORL problems.

Abstract

One common approach to solve multi-objective reinforcement learning (MORL) problems is to extend conventional Q-learning by using vector Q-values in combination with a utility function. However issues can arise with this approach in the context of stochastic environments, particularly when optimising for the Scalarised Expected Reward (SER) criterion. This paper extends prior research, providing a detailed examination of the factors influencing the frequency with which value-based MORL Q-learning algorithms learn the SER-optimal policy for an environment with stochastic state transitions. We empirically examine several variations of the core multi-objective Q-learning algorithm as well as reward engineering approaches, and demonstrate the limitations of these methods. In particular, we highlight the critical impact of the noisy Q-value estimates issue on the stability and convergence of these algorithms.

An Empirical Investigation of Value-Based Multi-objective Reinforcement Learning for Stochastic Environments

TL;DR

This paper investigates the challenges of learning the SER-optimal policy in stochastic MOMDPs using value-based MORL. By evaluating baseline MOQ-learning, reward engineering, MOSS with global statistics, and policy-options on the Space Traders benchmark, it highlights how noisy Q-value estimates and local decision-making impede convergence to SER-optimal policies. The study shows partial gains from each approach, with decaying learning rates mitigating noise and policy options offering the strongest improvement in a small setting, yet none providing a universal solution or scalable applicability. The findings suggest that future progress will likely require policy-gradient or distributional reinforcement learning paradigms that can jointly address local decision dynamics and uncertain returns in larger, real-world MORL problems.

Abstract

One common approach to solve multi-objective reinforcement learning (MORL) problems is to extend conventional Q-learning by using vector Q-values in combination with a utility function. However issues can arise with this approach in the context of stochastic environments, particularly when optimising for the Scalarised Expected Reward (SER) criterion. This paper extends prior research, providing a detailed examination of the factors influencing the frequency with which value-based MORL Q-learning algorithms learn the SER-optimal policy for an environment with stochastic state transitions. We empirically examine several variations of the core multi-objective Q-learning algorithm as well as reward engineering approaches, and demonstrate the limitations of these methods. In particular, we highlight the critical impact of the noisy Q-value estimates issue on the stability and convergence of these algorithms.
Paper Structure (26 sections, 2 equations, 13 figures, 14 tables, 4 algorithms)

This paper contains 26 sections, 2 equations, 13 figures, 14 tables, 4 algorithms.

Figures (13)

  • Figure 1: The Space Traders MOMDP. Solid black lines show the Direct actions, solid grey line show the Indirect actions, and dashed lines indicate Teleport actions. Solid black circles indicate terminal (failure) states VamplewEnvironmental2022
  • Figure 2: Policy charts showing the greedy policy produced by the baseline multi-objective Q-learning algorithm (Algorithm \ref{['algo:moql-expected']}) on the Space Traders environment. Each chart shows the greedy policy identified by the agent at each episode of four different trials, culminating in different final policies. The dashed green line represents the threshold used for TLO, to highlight which policies meet this threshold.
  • Figure 3: The Policy chart for baseline method with a decayed learning rate in the Space Traders Environment
  • Figure 4: The Space Traders MR environment, which has the same state transition dynamics as the original Space Traders but with a modified reward design. The changed rewards have been highlighted in red.
  • Figure 5: Policy charts for MOQ-learning on the Space Traders MR environment -- each chart illustrates a sample run culminating in a different final policy. Charts (a)-(f) are from runs using a constant learning rate, while chart (g) is from a run using a decayed learning rate.
  • ...and 8 more figures