MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning

Yifu Yuan; Zhenrui Zheng; Zibin Dong; Jianye Hao

MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning

Yifu Yuan, Zhenrui Zheng, Zibin Dong, Jianye Hao

TL;DR

MODULI presents a diffusion-based planning framework for Offline MORL that conditions trajectory generation on user preferences and multi-objective returns. It introduces two return-normalization methods and a novel sliding guidance mechanism with a slider adapter to improve generalization to out-of-distribution preferences. Through extensive experiments on the D4MORL benchmark, MODULI achieves superior Pareto front approximation, better OOD generalization (lower RD), and denser fronts compared with strong baselines. The approach enables efficient, preference-aware planning offline with practical implications for multi-objective robotic control and decision making.

Abstract

Multi-objective Reinforcement Learning (MORL) seeks to develop policies that simultaneously optimize multiple conflicting objectives, but it requires extensive online interactions. Offline MORL provides a promising solution by training on pre-collected datasets to generalize to any preference upon deployment. However, real-world offline datasets are often conservatively and narrowly distributed, failing to comprehensively cover preferences, leading to the emergence of out-of-distribution (OOD) preference areas. Existing offline MORL algorithms exhibit poor generalization to OOD preferences, resulting in policies that do not align with preferences. Leveraging the excellent expressive and generalization capabilities of diffusion models, we propose MODULI (Multi-objective Diffusion Planner with Sliding Guidance), which employs a preference-conditioned diffusion model as a planner to generate trajectories that align with various preferences and derive action for decision-making. To achieve accurate generation, MODULI introduces two return normalization methods under diverse preferences for refining guidance. To further enhance generalization to OOD preferences, MODULI proposes a novel sliding guidance mechanism, which involves training an additional slider adapter to capture the direction of preference changes. Incorporating the slider, it transitions from in-distribution (ID) preferences to generating OOD preferences, patching, and extending the incomplete Pareto front. Extensive experiments on the D4MORL benchmark demonstrate that our algorithm outperforms state-of-the-art Offline MORL baselines, exhibiting excellent generalization to OOD preferences.

MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning

TL;DR

Abstract

Paper Structure (41 sections, 21 equations, 7 figures, 7 tables, 3 algorithms)

This paper contains 41 sections, 21 equations, 7 figures, 7 tables, 3 algorithms.

Introduction
Related Work
Offline MORL
Diffusion Models for Decision Making
Preliminaries
Multi-objective RL
Denoising Diffusion Implicit Models (DDIM)
Guided Sampling Methods
Methodology
Offline MORL as Conditional Generative Planning
Training
Refining Guidance via Preference-Return Normalization Methods
Global Normalization
Preference Predicted Normalization
Neighborhood Preference Normalization
...and 26 more sections

Figures (7)

Figure 1: (Left) Trajectory returns for Hopper-Amateur datasets in D4MORL. We visualize Complete, Shattered, and Narrow datasets for comparison. OOD preference regions are marked by the red circles. (Right) The approximated Pareto fronts learned by MORvS and MODULI on the Shattered dataset.
Figure 2: Approximated Pareto fronts on Complete dataset by MODULI. Undominated / Dominated solutions are colored in red / blue, and dataset trajectories are colored in grey.
Figure 3: Solution (Blue points) under OOD preference for different algorithms in Walker2d-expert-Shattered (Top) and Hopper-amateur-Narrow (Bottom). Dataset trajectories are colored in grey.
Figure 4: Comparison of HV, SP, and RD performance of MODULI with and without slider across various environments. All metrics are rescaled by dividing by their maximum values. The suffix "-e" denotes Expert and "-a" denotes Amateur.
Figure 5: The Pareto front illustrations for MODULI with and without the Slider, generated on the Hopper-Amateur-Narrow dataset, demonstrate that the inclusion of the Slider significantly extends the boundary of the policies.
...and 2 more figures

Theorems & Definitions (3)

Definition 1.1: Hypervolume (HV)
Definition 1.2: Sparsity (SP)
Definition 1.3: Return Deviation (RD)

MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning

TL;DR

Abstract

MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (3)