Offline Reinforcement Learning with Behavioral Supervisor Tuning

Padmanaba Srinivasan; William Knottenbelt

Offline Reinforcement Learning with Behavioral Supervisor Tuning

Padmanaba Srinivasan, William Knottenbelt

TL;DR

The paper addresses the challenge of offline RL with static datasets that require extensive per-dataset tuning by introducing TD3-BST, which uses a Morse-network uncertainty model as a behavioral supervisor to dynamically weight regularization around dataset modes. By deriving a closed-form BST policy update and integrating it with TD3-BBC-style actor-critic training, the method adapts the degree of behavioral cloning based on epistemic uncertainty, enabling effective learning from suboptimal data. Empirical results on the D4RL suite show state-of-the-art performance on locomotion benchmarks and leading results on Antmaze without per-dataset tuning, with additional improvements when applying BST to one-step methods like IQL. The work demonstrates practical benefits of combining uncertainty-based supervision with offline RL and points to future directions in alternative uncertainty measures and multi-source ensembles.

Abstract

Offline reinforcement learning (RL) algorithms are applied to learn performant, well-generalizing policies when provided with a static dataset of interactions. Many recent approaches to offline RL have seen substantial success, but with one key caveat: they demand substantial per-dataset hyperparameter tuning to achieve reported performance, which requires policy rollouts in the environment to evaluate; this can rapidly become cumbersome. Furthermore, substantial tuning requirements can hamper the adoption of these algorithms in practical domains. In this paper, we present TD3 with Behavioral Supervisor Tuning (TD3-BST), an algorithm that trains an uncertainty model and uses it to guide the policy to select actions within the dataset support. TD3-BST can learn more effective policies from offline datasets compared to previous methods and achieves the best performance across challenging benchmarks without requiring per-dataset tuning.

Offline Reinforcement Learning with Behavioral Supervisor Tuning

TL;DR

Abstract

Paper Structure (29 sections, 2 theorems, 11 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 11 equations, 6 figures, 3 tables, 1 algorithm.

Introduction
Contributions
Related Work
Offline Reinforcement Learning
Policy constraint
Critic Regularization
Ensembles
Model-Based Uncertainty Estimation
Uncertainty Estimation
Preliminaries
Policy Constraint with a Behavioral Supervisor
Morse Networks for Offline RL
TD3-BST
Interpretation
Controlling the Tradeoff Constraint
...and 14 more sections

Key Result

Proposition 1

A Morse neural network can be expressed as an energy-based model: $E_{\phi} (x) = e^{-\log M_{\phi} (x)}$ where $M_{\phi}: \mathbb{R}^{d} \rightarrow \mathbb{R}$.

Figures (6)

Figure 1: An illustration of our method versus typical, TD3-BC-like actor-constraint methods. TD3-BC: a) A policy selecting an OOD action is constrained to select in-dataset actions. b) A policy selecting the optimal action may be penalized for not selecting an in-dataset, but not in-batch, inferior action. Our method: c) A policy selecting OOD actions is drawn towards in-dataset actions with decreasing constraint coefficient as it moves closer to any supported action. d) An optimal policy is not penalized for selecting an in-dataset action when the action is not contained in the current batch.
Figure 2: a-d: Contour plots of unnormalized densities $(\in [0, 1])$ produced by a Morse network for increasing $\lambda$ with ground truth actions included as $\times$ marks. e: Ground truth actions (blue) in the synthetic dataset, the MLE action (red). A Morse certainty weighted MLE model can select actions in a single mode, in this case, the (right-hand side) mode centred at [0.8, 0.0] (orange). Weighting a divergence constraint using a Morse (un)certainty will encourage the policy to select actions near the modes of $M_{\phi}$ that maximize reward.
Figure 3: $M_{\phi}$ densities and t-SNE of deviations for hopper-medium-expert (top row) and Antmaze-large-diverse (bottom row). Density plots are clipped at 10.0 as density for $\mathcal{D}$ is large. 10 actions are sampled from $\mathcal{D}_{\text{uni}}$ and $\mathcal{D}_{\text{perm}}$ each, per state.
Figure 4: Ablations of $\lambda$ on Antmaze datasets. Recall $k = \dim(\mathcal{A})$.
Figure 5: Histograms of deviation from dataset actions.
...and 1 more figures

Theorems & Definitions (4)

Definition 1: Morse Kernel
Definition 2: Morse Neural Network
Proposition 1
Theorem 1

Offline Reinforcement Learning with Behavioral Supervisor Tuning

TL;DR

Abstract

Offline Reinforcement Learning with Behavioral Supervisor Tuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)