Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization
Tanmay Ambadkar, Sourav Panda, Shreyash Kale, Jonathan Dodge, Abhinav Verma
TL;DR
This work addresses multi-objective reinforcement learning (MORL) where conflicting objectives make single-policy optimization fragile due to gradient interference and mode collapse. It presents D3PO, a PPO-based framework that preserves per-objective learning signals via a multi-head critic $V^{(i)}(s,\omega)$, applies PPO surrogates per objective, and integrates preferences only after stabilization through Late-Stage Weighting, complemented by a scaled diversity regularizer to map distinct preferences to distinct behaviors. The key contributions include formal analysis showing advantages of Late-Stage Weighting over early scalarization, a diversity regularizer that prevents collapse, and extensive experiments on high-dimensional and many-objective control tasks demonstrating broader, higher-quality Pareto fronts with a single policy and favorable HV and EU metrics. D3PO achieves competitive or superior front coverage with substantially lower memory and deployment complexity than multi-policy baselines, enabling scalable, preference-conditioned MORL in real-world domains. These results advance practical MORL by providing a robust, theory-grounded single-policy approach that preserves per-objective signals and encourages diverse responses across the preference space.
Abstract
Multi-objective reinforcement learning (MORL) seeks to learn policies that balance multiple, often conflicting objectives. Although a single preference-conditioned policy is the most flexible and scalable solution, existing approaches remain brittle in practice, frequently failing to recover complete Pareto fronts. We show that this failure stems from two structural issues in current methods: destructive gradient interference caused by premature scalarization and representational collapse across the preference space. We introduce $D^3PO$, a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly. $D^3PO$ preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization, enabling reliable credit assignment. In addition, a scaled diversity regularizer enforces sensitivity of policy behavior to preference changes, preventing collapse. Across standard MORL benchmarks, including high-dimensional and many-objective control tasks, $D^3PO$ consistently discovers broader and higher-quality Pareto fronts than prior single- and multi-policy methods, matching or exceeding state-of-the-art hypervolume and expected utility while using a single deployable policy.
