Table of Contents
Fetching ...

In Search for Architectures and Loss Functions in Multi-Objective Reinforcement Learning

Mikhail Terekhov, Caglar Gulcehre

TL;DR

The paper tackles MORL by proposing Dynamic MORL (DMORL), which conditions a single policy on objective weights α ∈ Δ_K to cover the Pareto front. It introduces two on-policy methods, MOPPO (multi-objective PPO) and MOA2C, and studies three actor-critic architectures (Multi-body, Merge net, Hypernetwork) with PopArt normalization and entropy-control to stabilize learning. Through experiments on MO-Gym benchmarks Deep Sea Treasure, Minecart, and Reacher, MOPPO generally achieves higher hypervolume and better Pareto-front coverage than baselines like Pareto Conditioned Networks and Envelope Q-learning, especially in stochastic settings. The work provides practical insights for building scalable, architecture-aware MORL systems, while noting limitations such as reliance on linear scalarization and the absence of theoretical convergence guarantees.

Abstract

Multi-objective reinforcement learning (MORL) is essential for addressing the intricacies of real-world RL problems, which often require trade-offs between multiple utility functions. However, MORL is challenging due to unstable learning dynamics with deep learning-based function approximators. The research path most taken has been to explore different value-based loss functions for MORL to overcome this issue. Our work empirically explores model-free policy learning loss functions and the impact of different architectural choices. We introduce two different approaches: Multi-objective Proximal Policy Optimization (MOPPO), which extends PPO to MORL, and Multi-objective Advantage Actor Critic (MOA2C), which acts as a simple baseline in our ablations. Our proposed approach is straightforward to implement, requiring only small modifications at the level of function approximator. We conduct comprehensive evaluations on the MORL Deep Sea Treasure, Minecart, and Reacher environments and show that MOPPO effectively captures the Pareto front. Our extensive ablation studies and empirical analyses reveal the impact of different architectural choices, underscoring the robustness and versatility of MOPPO compared to popular MORL approaches like Pareto Conditioned Networks (PCN) and Envelope Q-learning in terms of MORL metrics, including hypervolume and expected utility.

In Search for Architectures and Loss Functions in Multi-Objective Reinforcement Learning

TL;DR

The paper tackles MORL by proposing Dynamic MORL (DMORL), which conditions a single policy on objective weights α ∈ Δ_K to cover the Pareto front. It introduces two on-policy methods, MOPPO (multi-objective PPO) and MOA2C, and studies three actor-critic architectures (Multi-body, Merge net, Hypernetwork) with PopArt normalization and entropy-control to stabilize learning. Through experiments on MO-Gym benchmarks Deep Sea Treasure, Minecart, and Reacher, MOPPO generally achieves higher hypervolume and better Pareto-front coverage than baselines like Pareto Conditioned Networks and Envelope Q-learning, especially in stochastic settings. The work provides practical insights for building scalable, architecture-aware MORL systems, while noting limitations such as reliance on linear scalarization and the absence of theoretical convergence guarantees.

Abstract

Multi-objective reinforcement learning (MORL) is essential for addressing the intricacies of real-world RL problems, which often require trade-offs between multiple utility functions. However, MORL is challenging due to unstable learning dynamics with deep learning-based function approximators. The research path most taken has been to explore different value-based loss functions for MORL to overcome this issue. Our work empirically explores model-free policy learning loss functions and the impact of different architectural choices. We introduce two different approaches: Multi-objective Proximal Policy Optimization (MOPPO), which extends PPO to MORL, and Multi-objective Advantage Actor Critic (MOA2C), which acts as a simple baseline in our ablations. Our proposed approach is straightforward to implement, requiring only small modifications at the level of function approximator. We conduct comprehensive evaluations on the MORL Deep Sea Treasure, Minecart, and Reacher environments and show that MOPPO effectively captures the Pareto front. Our extensive ablation studies and empirical analyses reveal the impact of different architectural choices, underscoring the robustness and versatility of MOPPO compared to popular MORL approaches like Pareto Conditioned Networks (PCN) and Envelope Q-learning in terms of MORL metrics, including hypervolume and expected utility.
Paper Structure (30 sections, 23 equations, 11 figures, 3 tables, 4 algorithms)

This paper contains 30 sections, 23 equations, 11 figures, 3 tables, 4 algorithms.

Figures (11)

  • Figure 1: Pareto fronts on Deep Sea Treasure: Performance of a selection of our methods on the Deep Sea Treasure environment, split by the actor-critic architecture and the learning algorithm. The details of the architectures and algorithms are described in Sections \ref{['sec:arch']} and \ref{['sec:algo']}, respectively. In this simple gridworld, the agent's task is to find the biggest treasure, but big treasures require it to spend more fuel. The tension between the two objectives is formalized as a Pareto front of the problem. Our proposed approaches effectively cover the true Pareto front. Some of the methods produce a few outliers because the policy struggles to learn near the boundary of the simplex $\Delta_K$ of reward weights, where one of the rewards (in this case, fuel) is completely discarded.
  • Figure 2: Actor-critic architectures with shared trunks: Non-shared versions are organized similarly. The dashed line in the hypernetwork chart is optional: $s_t$ can be passed into the hypernetwork, where we get the architecture's "hypernet w/obs" variant.
  • Figure 3: Entropy control schedules: Example entropy behaviors on Minecart when using the entropy control method described in Section \ref{['sec:entropy']}. From left to right: custom schedule, cosine schedule and linear schedule of entropy. The custom schedule is designed to have a flat start for exploration and an extended flat end for fine-tuning the behavior. The schedules are discussed in detail in Appendix \ref{['app:entropy']}.
  • Figure 4: Architecture ablations on Minecart:Left: comparison of the algorithms and policy architectures on Minecart. The color represents the algorithm (MOA2C or MOPPO), filled columns correspond to shared trunk architectures, and hatching denotes the specific architecture described in Section \ref{['sec:arch']}. Right: comparison of entropy control from Section \ref{['sec:entropy']} using three entropy schedules described in Appendix \ref{['app:entropy']} with standard entropy regularization using a fixed weight $\lambda$.
  • Figure 5: Hypervolume of all hyperparameter configurations for the ablations demonstrated in Figure \ref{['fig:minecart-tuning']} and Table \ref{['tab:big_table']} run on the non-deterministic Minecart environment. Part 1. "Hypernet+o." is the architecture where the observation is provided to the hypernetwork along with $\boldsymbol{\alpha}$.
  • ...and 6 more figures