Offline Reinforcement Learning with Behavioral Supervisor Tuning
Padmanaba Srinivasan, William Knottenbelt
TL;DR
The paper addresses the challenge of offline RL with static datasets that require extensive per-dataset tuning by introducing TD3-BST, which uses a Morse-network uncertainty model as a behavioral supervisor to dynamically weight regularization around dataset modes. By deriving a closed-form BST policy update and integrating it with TD3-BBC-style actor-critic training, the method adapts the degree of behavioral cloning based on epistemic uncertainty, enabling effective learning from suboptimal data. Empirical results on the D4RL suite show state-of-the-art performance on locomotion benchmarks and leading results on Antmaze without per-dataset tuning, with additional improvements when applying BST to one-step methods like IQL. The work demonstrates practical benefits of combining uncertainty-based supervision with offline RL and points to future directions in alternative uncertainty measures and multi-source ensembles.
Abstract
Offline reinforcement learning (RL) algorithms are applied to learn performant, well-generalizing policies when provided with a static dataset of interactions. Many recent approaches to offline RL have seen substantial success, but with one key caveat: they demand substantial per-dataset hyperparameter tuning to achieve reported performance, which requires policy rollouts in the environment to evaluate; this can rapidly become cumbersome. Furthermore, substantial tuning requirements can hamper the adoption of these algorithms in practical domains. In this paper, we present TD3 with Behavioral Supervisor Tuning (TD3-BST), an algorithm that trains an uncertainty model and uses it to guide the policy to select actions within the dataset support. TD3-BST can learn more effective policies from offline datasets compared to previous methods and achieves the best performance across challenging benchmarks without requiring per-dataset tuning.
