In Search for Architectures and Loss Functions in Multi-Objective Reinforcement Learning
Mikhail Terekhov, Caglar Gulcehre
TL;DR
The paper tackles MORL by proposing Dynamic MORL (DMORL), which conditions a single policy on objective weights α ∈ Δ_K to cover the Pareto front. It introduces two on-policy methods, MOPPO (multi-objective PPO) and MOA2C, and studies three actor-critic architectures (Multi-body, Merge net, Hypernetwork) with PopArt normalization and entropy-control to stabilize learning. Through experiments on MO-Gym benchmarks Deep Sea Treasure, Minecart, and Reacher, MOPPO generally achieves higher hypervolume and better Pareto-front coverage than baselines like Pareto Conditioned Networks and Envelope Q-learning, especially in stochastic settings. The work provides practical insights for building scalable, architecture-aware MORL systems, while noting limitations such as reliance on linear scalarization and the absence of theoretical convergence guarantees.
Abstract
Multi-objective reinforcement learning (MORL) is essential for addressing the intricacies of real-world RL problems, which often require trade-offs between multiple utility functions. However, MORL is challenging due to unstable learning dynamics with deep learning-based function approximators. The research path most taken has been to explore different value-based loss functions for MORL to overcome this issue. Our work empirically explores model-free policy learning loss functions and the impact of different architectural choices. We introduce two different approaches: Multi-objective Proximal Policy Optimization (MOPPO), which extends PPO to MORL, and Multi-objective Advantage Actor Critic (MOA2C), which acts as a simple baseline in our ablations. Our proposed approach is straightforward to implement, requiring only small modifications at the level of function approximator. We conduct comprehensive evaluations on the MORL Deep Sea Treasure, Minecart, and Reacher environments and show that MOPPO effectively captures the Pareto front. Our extensive ablation studies and empirical analyses reveal the impact of different architectural choices, underscoring the robustness and versatility of MOPPO compared to popular MORL approaches like Pareto Conditioned Networks (PCN) and Envelope Q-learning in terms of MORL metrics, including hypervolume and expected utility.
