Table of Contents
Fetching ...

Learning Control Policies for Variable Objectives from Offline Data

Marc Weber, Phillip Swazinna, Daniel Hein, Steffen Udluft, Volkmar Sterzing

TL;DR

This work tackles offline reinforcement learning when operator preferences change over time by introducing a variable objective policy (VOP) that conditions the policy on differentiable reward-objective parameters $\Omega$ and learns from a fixed offline dataset. A model-based offline RL framework with an ensemble of dynamics models generates virtual rollouts $\mathcal{T}^\pi(s_0, \Omega)$, enabling policy optimization via gradient descent to maximize the expected return $\mathbb{E}[\hat{R}]$ while conditioning on $\Omega$. The key contributions are a dedicated offline policy-search method that supports continuous objective parameters, demonstration on cart-pole upswing and the industrial benchmark showing cross-$\Omega$ generalization, and evidence that policies transfer from learned dynamics to the real environment. The approach offers data-efficient, runtime-flexible control suitable for safety-critical and industrial settings, enabling operators to rebalance objectives without retraining.

Abstract

Offline reinforcement learning provides a viable approach to obtain advanced control strategies for dynamical systems, in particular when direct interaction with the environment is not available. In this paper, we introduce a conceptual extension for model-based policy search methods, called variable objective policy (VOP). With this approach, policies are trained to generalize efficiently over a variety of objectives, which parameterize the reward function. We demonstrate that by altering the objectives passed as input to the policy, users gain the freedom to adjust its behavior or re-balance optimization targets at runtime, without need for collecting additional observation batches or re-training.

Learning Control Policies for Variable Objectives from Offline Data

TL;DR

This work tackles offline reinforcement learning when operator preferences change over time by introducing a variable objective policy (VOP) that conditions the policy on differentiable reward-objective parameters and learns from a fixed offline dataset. A model-based offline RL framework with an ensemble of dynamics models generates virtual rollouts , enabling policy optimization via gradient descent to maximize the expected return while conditioning on . The key contributions are a dedicated offline policy-search method that supports continuous objective parameters, demonstration on cart-pole upswing and the industrial benchmark showing cross- generalization, and evidence that policies transfer from learned dynamics to the real environment. The approach offers data-efficient, runtime-flexible control suitable for safety-critical and industrial settings, enabling operators to rebalance objectives without retraining.

Abstract

Offline reinforcement learning provides a viable approach to obtain advanced control strategies for dynamical systems, in particular when direct interaction with the environment is not available. In this paper, we introduce a conceptual extension for model-based policy search methods, called variable objective policy (VOP). With this approach, policies are trained to generalize efficiently over a variety of objectives, which parameterize the reward function. We demonstrate that by altering the objectives passed as input to the policy, users gain the freedom to adjust its behavior or re-balance optimization targets at runtime, without need for collecting additional observation batches or re-training.
Paper Structure (22 sections, 10 equations, 7 figures, 1 table)

This paper contains 22 sections, 10 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Illustration of the data and model workflow architecture (a), and the structure of the training algorithm for the VOP (b).
  • Figure 2: Schematic drawing of the cart-pole upswing and balancing setup with our VOP extensions highlighted in red and purple. The two variable objective parameters, $\Omega_x$ and $\Omega_{\theta}$, encode continuous target settings in $x$-axis and upswing behavior, respectively.
  • Figure 3: Return improvement as function of VOP training epochs for the cart-pole upswing benchmark. Shaded areas describe full width of distributions resulting from random sampling of initial conditions and objectives.
  • Figure 4: Comparison of returns achieved after $250$ simulated steps by the trained VOP (blue) at selected objectives, $\Omega_x$ and $\Omega_\theta$, and individual policies trained exclusively for each objective value pair (red). Error bars indicate the full range of returns for a fixed set of $100$ random start positions of the cart and the pole.
  • Figure 5: Evaluation of our trained VOP for $\Omega_{\theta}\!=\!1$ and $\Omega_{\theta}\!=\!3$. In the beginning the cart is positioned at the leftmost corner with the pole pointing downwards. After $150$ steps the target position is suddenly changed to $\Omega_x\!=\!+1\,\mathrm{m}$. Depending on $\Omega_{\theta}$, the policy either moves to the new position aggressively, thereby dropping the pole in between positions (green), or more carefully, without dropping it (violet).
  • ...and 2 more figures