Policy-regularized Offline Multi-objective Reinforcement Learning

Qian Lin; Chao Yu; Zongkai Liu; Zifan Wu

Policy-regularized Offline Multi-objective Reinforcement Learning

Qian Lin, Chao Yu, Zongkai Liu, Zifan Wu

TL;DR

This work tackles offline multi-objective reinforcement learning by extending offline policy-regularized RL to linear preference settings, addressing the preference-inconsistent demonstration (PrefID) problem through behavior-preference filtering and high-expressiveness regularization. It introduces a preference-conditioned scalarized update to learn multiple Pareto policies with a single network and proposes Regularization Weight Adaptation to adapt regularization strength to target preferences during deployment. The approach demonstrates competitive performance on the D4MORL and MOSB offline MORL benchmarks, outperforming baselines in several metrics, and shows robust handling of out-of-distribution preferences. Overall, the method broadens the applicability of offline MORL by reducing data requirements, enabling dynamic preference handling, and expanding the learned Pareto front.

Abstract

In this paper, we aim to utilize only offline trajectory data to train a policy for multi-objective RL. We extend the offline policy-regularized method, a widely-adopted approach for single-objective offline RL problems, into the multi-objective setting in order to achieve the above goal. However, such methods face a new challenge in offline MORL settings, namely the preference-inconsistent demonstration problem. We propose two solutions to this problem: 1) filtering out preference-inconsistent demonstrations via approximating behavior preferences, and 2) adopting regularization techniques with high policy expressiveness. Moreover, we integrate the preference-conditioned scalarized update method into policy-regularized offline RL, in order to simultaneously learn a set of policies using a single policy network, thus reducing the computational cost induced by the training of a large number of individual policies for various preferences. Finally, we introduce Regularization Weight Adaptation to dynamically determine appropriate regularization weights for arbitrary target preferences during deployment. Empirical results on various multi-objective datasets demonstrate the capability of our approach in solving offline MORL problems.

Policy-regularized Offline Multi-objective Reinforcement Learning

TL;DR

Abstract

Paper Structure (21 sections, 16 equations, 7 figures, 6 tables, 2 algorithms)

This paper contains 21 sections, 16 equations, 7 figures, 6 tables, 2 algorithms.

Introduction
Related work
Background
Policy-regularized Offline MORL
The PrefID Problem
Mitigating the PrefID Problem
Responding to Arbitrary Target Preferences
Experiments
Experiments on D4MORL
Experiments on MOSB
Conclusion
Implementation Details of Policy-regularized Offline MORL
Implementation Details of Baselines
Evaluation Metrics
Details of Environments and Datasets
...and 6 more sections

Figures (7)

Figure 1: The performance when applying policy-regularized methods on the Ant amateur and Hopper expert datasets within D4MORL. Two solid circles with the same color represent the expected vector return of policies trained under preferences ${\bm{\omega}}_1$ and ${\bm{\omega}}_2$ respectively (the farther these two circles extend to the high area of axes, the better performance they exhibit). Each cross signifies an aborted training due to the divergence of value estimates. The black dots represent trajectories of entire offline dataset, which reflect performance of behavior policies under various preferences.
Figure 2: Approximate Pareto fronts learned by our approach with Diffusion regularization on the Amateur datasets.
Figure 3: Vector returns of trajectories in MOSB dataset and approximate Pareto fronts learned by Diffusion regularization.
Figure 4: Illustration of Hypervolume and sparsity metrics.
Figure 5: Relationship between adapted regularization weight and offline data distribution. The orange line represents the data density under different behavior preference in offline dataset. The heatmap indicates the utility under various target preferences and different regularization weights.
...and 2 more figures

Theorems & Definitions (2)

definition 1: Hypervolume
definition 2: Sparsity

Policy-regularized Offline Multi-objective Reinforcement Learning

TL;DR

Abstract

Policy-regularized Offline Multi-objective Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (2)