Table of Contents
Fetching ...

The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning

Anya Sims, Cong Lu, Jakob Foerster, Yee Whye Teh

TL;DR

The paper identifies a fundamental edge-of-reach problem in offline model-based reinforcement learning: truncating model rollouts creates states that can only appear as terminal targets and are never updated, causing bootstrapping from void and catastrophic $Q$-value overestimation. Through theoretical formalization and empirical evidence on D4RL and simple toy environments, the authors show that improving the learned dynamics alone can cause existing MB-RL methods to fail, even with error-free dynamics. They propose Reach-Aware Value Learning (RAVL), which substitutes dynamics uncertainty penalties with value pessimism via a $Q$-ensemble to detect edge-of-reach states and downweight their impact. RAVL demonstrates robust performance across learned and true-dynamics settings, including pixel-based variants, and reveals a practical path toward “future-proof” offline RL by directly addressing the edge-of-reach problem rather than relying on dynamics accuracy. The work also connects model-based and model-free perspectives, offering a unified view and guidance for future offline RL design, with open-source code available at github.com/anyasims/edge-of-reach.

Abstract

Offline reinforcement learning aims to train agents from pre-collected datasets. However, this comes with the added challenge of estimating the value of behaviors not covered in the dataset. Model-based methods offer a potential solution by training an approximate dynamics model, which then allows collection of additional synthetic data via rollouts in this model. The prevailing theory treats this approach as online RL in an approximate dynamics model, and any remaining performance gap is therefore understood as being due to dynamics model errors. In this paper, we analyze this assumption and investigate how popular algorithms perform as the learned dynamics model is improved. In contrast to both intuition and theory, if the learned dynamics model is replaced by the true error-free dynamics, existing model-based methods completely fail. This reveals a key oversight: The theoretical foundations assume sampling of full horizon rollouts in the learned dynamics model; however, in practice, the number of model-rollout steps is aggressively reduced to prevent accumulating errors. We show that this truncation of rollouts results in a set of edge-of-reach states at which we are effectively ``bootstrapping from the void.'' This triggers pathological value overestimation and complete performance collapse. We term this the edge-of-reach problem. Based on this new insight, we fill important gaps in existing theory, and reveal how prior model-based methods are primarily addressing the edge-of-reach problem, rather than model-inaccuracy as claimed. Finally, we propose Reach-Aware Value Learning (RAVL), a simple and robust method that directly addresses the edge-of-reach problem and hence - unlike existing methods - does not fail as the dynamics model is improved. Code open-sourced at: github.com/anyasims/edge-of-reach.

The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning

TL;DR

The paper identifies a fundamental edge-of-reach problem in offline model-based reinforcement learning: truncating model rollouts creates states that can only appear as terminal targets and are never updated, causing bootstrapping from void and catastrophic -value overestimation. Through theoretical formalization and empirical evidence on D4RL and simple toy environments, the authors show that improving the learned dynamics alone can cause existing MB-RL methods to fail, even with error-free dynamics. They propose Reach-Aware Value Learning (RAVL), which substitutes dynamics uncertainty penalties with value pessimism via a -ensemble to detect edge-of-reach states and downweight their impact. RAVL demonstrates robust performance across learned and true-dynamics settings, including pixel-based variants, and reveals a practical path toward “future-proof” offline RL by directly addressing the edge-of-reach problem rather than relying on dynamics accuracy. The work also connects model-based and model-free perspectives, offering a unified view and guidance for future offline RL design, with open-source code available at github.com/anyasims/edge-of-reach.

Abstract

Offline reinforcement learning aims to train agents from pre-collected datasets. However, this comes with the added challenge of estimating the value of behaviors not covered in the dataset. Model-based methods offer a potential solution by training an approximate dynamics model, which then allows collection of additional synthetic data via rollouts in this model. The prevailing theory treats this approach as online RL in an approximate dynamics model, and any remaining performance gap is therefore understood as being due to dynamics model errors. In this paper, we analyze this assumption and investigate how popular algorithms perform as the learned dynamics model is improved. In contrast to both intuition and theory, if the learned dynamics model is replaced by the true error-free dynamics, existing model-based methods completely fail. This reveals a key oversight: The theoretical foundations assume sampling of full horizon rollouts in the learned dynamics model; however, in practice, the number of model-rollout steps is aggressively reduced to prevent accumulating errors. We show that this truncation of rollouts results in a set of edge-of-reach states at which we are effectively ``bootstrapping from the void.'' This triggers pathological value overestimation and complete performance collapse. We term this the edge-of-reach problem. Based on this new insight, we fill important gaps in existing theory, and reveal how prior model-based methods are primarily addressing the edge-of-reach problem, rather than model-inaccuracy as claimed. Finally, we propose Reach-Aware Value Learning (RAVL), a simple and robust method that directly addresses the edge-of-reach problem and hence - unlike existing methods - does not fail as the dynamics model is improved. Code open-sourced at: github.com/anyasims/edge-of-reach.
Paper Structure (43 sections, 4 equations, 7 figures, 12 tables, 1 algorithm)

This paper contains 43 sections, 4 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: Existing offline model-based RL methods fail if the accuracy of the dynamics model is increased (with all else kept the same). Results shown are for MOPO mopo, but note that this failure indicates the failure of all existing uncertainty-based methods since each of their specific penalty terms disappear under the true dynamics as 'uncertainty' is zero. By contrast, our method is much more robust to changes in dynamics model. The $x$-axis shows linearly interpolating next states and rewards of the learned model with the true model (center$\rightarrow$right) and random model (center$\rightarrow$left), with results on the D4RL W2d-medexp benchmark (min/max over 4 seeds). The full set of results and experimental setup are provided in \ref{['tab:demo_failure']} and \ref{['sec:apdx_hparams']} respectively.
  • Figure 2: The previously unnoticed edge-of-reach problem. Left illustrates the base procedure used in offline model-based RL, whereby synthetic data is sampled as $k$-step trajectories "rollouts" starting from a state in the original offline dataset. Edge-of-reach states are those that can be reached in $k$-steps, but which cannot (under any policy) be reached in less than $k$-steps. We depict the data collected with two rollouts, one ending in $s_k=D$, and the other with $s_k=C$. Right then shows this data arranged into a dataset of transitions as used in $Q$-updates. State $D$ is edge-of-reach and hence appears in the dataset as $s'$but never as $s$. Bellman updates therefore bootstrap from $D$, but never update the value at $D$ (see \ref{['eqn:pol_eval']}). (For comparison consider state $C$: $C$ is also sampled at $s_k$, but unlike $D$ it is not edge-of-reach, and hence is also sampled at $s_{i<k}$ meaning it is updated and hence does not cause issues.)
  • Figure 3: Experiments on the simple environment, illustrating the edge-of-reach problem and potential solutions. (a) Reward function, (b) final (failed) policy with naïve application of the base procedure (see \ref{['alg:combined']}), (c) final (successful) policy with patching in oracle $Q$-values for edge-of-reach states, (d) final (successful) policy with RAVL, (e) returns evaluated over training, (f) mean $Q$-values evaluated over training.
  • Figure 4: RAVL's effective penalty of $Q$-ensemble variance on the environment in \ref{['sec:toyenv']}, showing that - as intended - edge-of-reach states have significantly higher penalty than within-reach states.
  • Figure 5: We find that the dynamics uncertainty-based penalty used in MOPO mopo is positively correlated with the variance of the value ensemble of RAVL, suggesting prior methods may unintentionally address the edge-of-reach problem. Pearson correlation coefficients are 0.49, 0.43, and 0.27 for Hopper-mixed, Walker2d-medexp, and Halfcheetah-medium respectively.
  • ...and 2 more figures

Theorems & Definitions (1)

  • proof