Policy-regularized Offline Multi-objective Reinforcement Learning
Qian Lin, Chao Yu, Zongkai Liu, Zifan Wu
TL;DR
This work tackles offline multi-objective reinforcement learning by extending offline policy-regularized RL to linear preference settings, addressing the preference-inconsistent demonstration (PrefID) problem through behavior-preference filtering and high-expressiveness regularization. It introduces a preference-conditioned scalarized update to learn multiple Pareto policies with a single network and proposes Regularization Weight Adaptation to adapt regularization strength to target preferences during deployment. The approach demonstrates competitive performance on the D4MORL and MOSB offline MORL benchmarks, outperforming baselines in several metrics, and shows robust handling of out-of-distribution preferences. Overall, the method broadens the applicability of offline MORL by reducing data requirements, enabling dynamic preference handling, and expanding the learned Pareto front.
Abstract
In this paper, we aim to utilize only offline trajectory data to train a policy for multi-objective RL. We extend the offline policy-regularized method, a widely-adopted approach for single-objective offline RL problems, into the multi-objective setting in order to achieve the above goal. However, such methods face a new challenge in offline MORL settings, namely the preference-inconsistent demonstration problem. We propose two solutions to this problem: 1) filtering out preference-inconsistent demonstrations via approximating behavior preferences, and 2) adopting regularization techniques with high policy expressiveness. Moreover, we integrate the preference-conditioned scalarized update method into policy-regularized offline RL, in order to simultaneously learn a set of policies using a single policy network, thus reducing the computational cost induced by the training of a large number of individual policies for various preferences. Finally, we introduce Regularization Weight Adaptation to dynamically determine appropriate regularization weights for arbitrary target preferences during deployment. Empirical results on various multi-objective datasets demonstrate the capability of our approach in solving offline MORL problems.
