Table of Contents
Fetching ...

Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

Trevor McInroe, Adam Jelley, Stefano V. Albrecht, Amos Storkey

TL;DR

The paper reframes offline-to-online reinforcement learning (OtO RL) as an exploration problem, arguing that policy-constraining offline methods can hinder performance when the behavior policy is suboptimal. It introduces PTGOOD, a non-myopic planning-based approach that uses the Conditional Entropy Bottleneck (CEB) to model the offline policy occupancy and a rate-based criterion to steer online data collection toward out-of-distribution, high-reward regions near the current policy, without altering rewards. PTGOOD builds a planning tree with width $w$ and depth $d$, sampling actions with Gaussian noise $\mathcal{N}(0,\epsilon)$, and selecting actions by maximizing the accumulated rate along trajectories predicted by a learned dynamics model $\hat{\mathcal{T}}$. Empirically, PTGOOD consistently yields higher online-finetuning returns across diverse continuous-control tasks and avoids suboptimal convergence observed with several baselines, outperforming them in most environment-dataset combinations. The work demonstrates the value of explicit, non-myopic planning for data collection in OtO RL and suggests directions for adaptive planning and planning-noise strategies to further improve robustness and sample efficiency.

Abstract

Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm well matched to a real-world RL deployment process. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but we show this can unnecessarily limit policy performance if the behavior policy is far from optimal. Instead, we forgo constraints and frame OtO RL as an exploration problem that aims to maximize the benefit of online data-collection. We first study the major online RL exploration methods based on intrinsic rewards and UCB in the OtO setting, showing that intrinsic rewards add training instability through reward-function modification, and UCB methods are myopic and it is unclear which learned-component's ensemble to use for action selection. We then introduce an algorithm for planning to go out-of-distribution (PTGOOD) that avoids these issues. PTGOOD uses a non-myopic planning procedure that targets exploration in relatively high-reward regions of the state-action space unlikely to be visited by the behavior policy. By leveraging concepts from the Conditional Entropy Bottleneck, PTGOOD encourages data collected online to provide new information relevant to improving the final deployment policy without altering rewards. We show empirically in several continuous control tasks that PTGOOD significantly improves agent returns during online fine-tuning and avoids the suboptimal policy convergence that many of our baselines exhibit in several environments.

Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

TL;DR

The paper reframes offline-to-online reinforcement learning (OtO RL) as an exploration problem, arguing that policy-constraining offline methods can hinder performance when the behavior policy is suboptimal. It introduces PTGOOD, a non-myopic planning-based approach that uses the Conditional Entropy Bottleneck (CEB) to model the offline policy occupancy and a rate-based criterion to steer online data collection toward out-of-distribution, high-reward regions near the current policy, without altering rewards. PTGOOD builds a planning tree with width and depth , sampling actions with Gaussian noise , and selecting actions by maximizing the accumulated rate along trajectories predicted by a learned dynamics model . Empirically, PTGOOD consistently yields higher online-finetuning returns across diverse continuous-control tasks and avoids suboptimal convergence observed with several baselines, outperforming them in most environment-dataset combinations. The work demonstrates the value of explicit, non-myopic planning for data collection in OtO RL and suggests directions for adaptive planning and planning-noise strategies to further improve robustness and sample efficiency.

Abstract

Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm well matched to a real-world RL deployment process. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but we show this can unnecessarily limit policy performance if the behavior policy is far from optimal. Instead, we forgo constraints and frame OtO RL as an exploration problem that aims to maximize the benefit of online data-collection. We first study the major online RL exploration methods based on intrinsic rewards and UCB in the OtO setting, showing that intrinsic rewards add training instability through reward-function modification, and UCB methods are myopic and it is unclear which learned-component's ensemble to use for action selection. We then introduce an algorithm for planning to go out-of-distribution (PTGOOD) that avoids these issues. PTGOOD uses a non-myopic planning procedure that targets exploration in relatively high-reward regions of the state-action space unlikely to be visited by the behavior policy. By leveraging concepts from the Conditional Entropy Bottleneck, PTGOOD encourages data collected online to provide new information relevant to improving the final deployment policy without altering rewards. We show empirically in several continuous control tasks that PTGOOD significantly improves agent returns during online fine-tuning and avoids the suboptimal policy convergence that many of our baselines exhibit in several environments.
Paper Structure (26 sections, 5 equations, 17 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 5 equations, 17 figures, 5 tables, 1 algorithm.

Figures (17)

  • Figure 1: Undiscounted evaluation returns in Halfcheetah (Random) (left) and DMC Walker (Random) (right) for $\lambda \in \{0, 0.1, 1, 10, 50\}$ intrinsic-reward weights throughout online fine-tuning.
  • Figure 2: Offline (orange) and online (blue) components in OtO RL, with PTGOOD planning shown on the far right. During offline pre-training, dynamics $\hat{\mathcal{T}}$, reward $\hat{R}$, encoder $e$, backward encoder $b$, marginal $m$, and policy $\pi$ (and other agent-related networks, depending on algorithm) are trained with data from $D_{\pi_b}$. During the online data-collection phase, PTGOOD's planner interacts with the environment using $\hat{\mathcal{T}}, e, m, \pi$, and stores data in $D_{\pi_o}$. Interleaved with data collection is fine-tuning, which occurs with data sampled from both $D_{\pi_b}$ and $D_{\pi_o}$. As shown on the right, PTGOOD's planning procedure follows the improving policy $\pi$ from a given $s$ towards increasingly higher reward regions of the $\mathcal{S} \times \mathcal{A}$ space, and targets data in those spaces that are unlikely under $\rho_{\pi_b}$.
  • Figure 3: Average (bold line) $\pm$ one standard deviation (shaded area) of evaluation returns for different $\epsilon$ values in PTGOOD's planner in Halfcheetah (Random) (left) and DMC Walker (Medium Replay) (right).
  • Figure 4: Undiscounted evaluation returns for RND/DeRL hyperparameter tuning.
  • Figure 5: Undiscounted evaluation returns for UCB(Q) hyperparameter tuning.
  • ...and 12 more figures