Learning Control Policies for Variable Objectives from Offline Data
Marc Weber, Phillip Swazinna, Daniel Hein, Steffen Udluft, Volkmar Sterzing
TL;DR
This work tackles offline reinforcement learning when operator preferences change over time by introducing a variable objective policy (VOP) that conditions the policy on differentiable reward-objective parameters $\Omega$ and learns from a fixed offline dataset. A model-based offline RL framework with an ensemble of dynamics models generates virtual rollouts $\mathcal{T}^\pi(s_0, \Omega)$, enabling policy optimization via gradient descent to maximize the expected return $\mathbb{E}[\hat{R}]$ while conditioning on $\Omega$. The key contributions are a dedicated offline policy-search method that supports continuous objective parameters, demonstration on cart-pole upswing and the industrial benchmark showing cross-$\Omega$ generalization, and evidence that policies transfer from learned dynamics to the real environment. The approach offers data-efficient, runtime-flexible control suitable for safety-critical and industrial settings, enabling operators to rebalance objectives without retraining.
Abstract
Offline reinforcement learning provides a viable approach to obtain advanced control strategies for dynamical systems, in particular when direct interaction with the environment is not available. In this paper, we introduce a conceptual extension for model-based policy search methods, called variable objective policy (VOP). With this approach, policies are trained to generalize efficiently over a variety of objectives, which parameterize the reward function. We demonstrate that by altering the objectives passed as input to the policy, users gain the freedom to adjust its behavior or re-balance optimization targets at runtime, without need for collecting additional observation batches or re-training.
