Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization

Tanmay Ambadkar; Sourav Panda; Shreyash Kale; Jonathan Dodge; Abhinav Verma

Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization

Tanmay Ambadkar, Sourav Panda, Shreyash Kale, Jonathan Dodge, Abhinav Verma

TL;DR

This work addresses multi-objective reinforcement learning (MORL) where conflicting objectives make single-policy optimization fragile due to gradient interference and mode collapse. It presents D3PO, a PPO-based framework that preserves per-objective learning signals via a multi-head critic $V^{(i)}(s,\omega)$, applies PPO surrogates per objective, and integrates preferences only after stabilization through Late-Stage Weighting, complemented by a scaled diversity regularizer to map distinct preferences to distinct behaviors. The key contributions include formal analysis showing advantages of Late-Stage Weighting over early scalarization, a diversity regularizer that prevents collapse, and extensive experiments on high-dimensional and many-objective control tasks demonstrating broader, higher-quality Pareto fronts with a single policy and favorable HV and EU metrics. D3PO achieves competitive or superior front coverage with substantially lower memory and deployment complexity than multi-policy baselines, enabling scalable, preference-conditioned MORL in real-world domains. These results advance practical MORL by providing a robust, theory-grounded single-policy approach that preserves per-objective signals and encourages diverse responses across the preference space.

Abstract

Multi-objective reinforcement learning (MORL) seeks to learn policies that balance multiple, often conflicting objectives. Although a single preference-conditioned policy is the most flexible and scalable solution, existing approaches remain brittle in practice, frequently failing to recover complete Pareto fronts. We show that this failure stems from two structural issues in current methods: destructive gradient interference caused by premature scalarization and representational collapse across the preference space. We introduce $D^3PO$, a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly. $D^3PO$ preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization, enabling reliable credit assignment. In addition, a scaled diversity regularizer enforces sensitivity of policy behavior to preference changes, preventing collapse. Across standard MORL benchmarks, including high-dimensional and many-objective control tasks, $D^3PO$ consistently discovers broader and higher-quality Pareto fronts than prior single- and multi-policy methods, matching or exceeding state-of-the-art hypervolume and expected utility while using a single deployable policy.

Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization

TL;DR

, applies PPO surrogates per objective, and integrates preferences only after stabilization through Late-Stage Weighting, complemented by a scaled diversity regularizer to map distinct preferences to distinct behaviors. The key contributions include formal analysis showing advantages of Late-Stage Weighting over early scalarization, a diversity regularizer that prevents collapse, and extensive experiments on high-dimensional and many-objective control tasks demonstrating broader, higher-quality Pareto fronts with a single policy and favorable HV and EU metrics. D3PO achieves competitive or superior front coverage with substantially lower memory and deployment complexity than multi-policy baselines, enabling scalable, preference-conditioned MORL in real-world domains. These results advance practical MORL by providing a robust, theory-grounded single-policy approach that preserves per-objective signals and encourages diverse responses across the preference space.

Abstract

, a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly.

preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization, enabling reliable credit assignment. In addition, a scaled diversity regularizer enforces sensitivity of policy behavior to preference changes, preventing collapse. Across standard MORL benchmarks, including high-dimensional and many-objective control tasks,

consistently discovers broader and higher-quality Pareto fronts than prior single- and multi-policy methods, matching or exceeding state-of-the-art hypervolume and expected utility while using a single deployable policy.

Paper Structure (47 sections, 7 theorems, 30 equations, 3 figures, 9 tables)

This paper contains 47 sections, 7 theorems, 30 equations, 3 figures, 9 tables.

Introduction
Related Work
Preliminaries
Method
Innovations
Per-Objective Advantage and Value Estimation
Policy Optimization with Decomposed Gradients and Diversity Regularization
Analysis of the D3PO Framework
Experiments
Conclusion
D3PO Pseudocode
Metrics Definitions
Discrete Environments Results
Ablation Experiments
Theoretical Analysis of Multi-Objective PPO Formulations
...and 32 more sections

Key Result

Lemma 5.1

Let $A_t^\omega:=\omega^\top\mathbf{A}_t$ and $M_{LSW}:=\sum_{i=1}^d \omega_i |A_t^{(i)}|$. Then with strict inequality whenever there exist $i,j$ with $A_t^{(i)}A_t^{(j)}<0$ and $\omega_i,\omega_j>0$.

Figures (3)

Figure 1: Overview of the $D^{3}PO$ framework. The architecture decouples credit assignment from preference integration to prevent gradient interference. (1) Multi-Head Critic: The critic estimates independent per-objective values $V^{(i)}(s,\omega)$ to compute unweighted advantages $A^{(i)}$. (2) PPO Surrogate Losses: The clipping mechanism is applied to each advantage stream independently Eq. \ref{['eq:ppo_loss']}, stabilizing the learning signal before scalarization. (3) Late-Stage Weighting: Preference weights $\omega$ are applied only to the stabilized surrogate losses Eq. \ref{['eq:final_loss']}, ensuring gradients are not cancelled prior to optimization. (4) Diversity Regularizer: A diversity loss Eq. \ref{['eq:diversity_loss']} is added to force behavioral separation between different preference queries, preventing mode collapse.
Figure 2: Pareto front comparison on two-objective MO-MuJoCo benchmarks. D3PO (red) discovers a uniform and well-distributed front across the trade-off space, whereas C-MORL (blue) refines extreme points at the cost of higher sparsity. Compared to CAPQL, GPI-LS, and PG-MORL, D3PO achieves broader coverage and reduced collapse, particularly visible in Ant and Humanoid.
Figure 3: Reward curves for different objectives and overall discounted return across environments.

Theorems & Definitions (21)

Definition 3.1: Pareto Dominance
Definition 3.2: Pareto-Optimal Policy
Definition 2.1: Hypervolume Indicator
Definition 2.2: Sparsity Indicator
Definition 2.3: Expected Utility
Definition 2.4: Compute Time
Lemma 5.1: ES magnitude loss
proof
Proposition 5.2: Conditional equivalence of MVS and LSW under homogeneous surrogate
proof : Proof sketch
...and 11 more

Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization

TL;DR

Abstract

Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (21)