Table of Contents
Fetching ...

Benchmarks for Reinforcement Learning with Biased Offline Data and Imperfect Simulators

Ori Linial, Guy Tennenholtz, Uri Shalit

TL;DR

This work tackles the challenge of training RL agents when real-world exploration is limited by cost or safety concerns by combining offline data with imperfect simulators. It introduces Benchmarks for Mechanistic Offline Reinforcement Learning (B4MRL), a suite that systematically probes four core issues—modeling error, partial observability/state discrepancy, action discrepancy, and confounding bias—through simulator perturbations and dataset alterations built on MuJoCo and Highway tasks with D4RL data. The study evaluates online, offline, and hybrid RL methods (including MOPO, TD3-BC, IQL, SAC, TD3, H2O, and HyMOPO) and finds that hybrid approaches do not always outperform the best of online or offline methods, especially when hidden confounding or severe observability gaps are present. These findings underscore the need for robust hybrid RL algorithms and demonstrate that the B4MRL benchmarks can drive progress toward more reliable, safe, and transferable RL systems in real-world settings.

Abstract

In many reinforcement learning (RL) applications one cannot easily let the agent act in the world; this is true for autonomous vehicles, healthcare applications, and even some recommender systems, to name a few examples. Offline RL provides a way to train agents without real-world exploration, but is often faced with biases due to data distribution shifts, limited coverage, and incomplete representation of the environment. To address these issues, practical applications have tried to combine simulators with grounded offline data, using so-called hybrid methods. However, constructing a reliable simulator is in itself often challenging due to intricate system complexities as well as missing or incomplete information. In this work, we outline four principal challenges for combining offline data with imperfect simulators in RL: simulator modeling error, partial observability, state and action discrepancies, and hidden confounding. To help drive the RL community to pursue these problems, we construct ``Benchmarks for Mechanistic Offline Reinforcement Learning'' (B4MRL), which provide dataset-simulator benchmarks for the aforementioned challenges. Our results suggest the key necessity of such benchmarks for future research.

Benchmarks for Reinforcement Learning with Biased Offline Data and Imperfect Simulators

TL;DR

This work tackles the challenge of training RL agents when real-world exploration is limited by cost or safety concerns by combining offline data with imperfect simulators. It introduces Benchmarks for Mechanistic Offline Reinforcement Learning (B4MRL), a suite that systematically probes four core issues—modeling error, partial observability/state discrepancy, action discrepancy, and confounding bias—through simulator perturbations and dataset alterations built on MuJoCo and Highway tasks with D4RL data. The study evaluates online, offline, and hybrid RL methods (including MOPO, TD3-BC, IQL, SAC, TD3, H2O, and HyMOPO) and finds that hybrid approaches do not always outperform the best of online or offline methods, especially when hidden confounding or severe observability gaps are present. These findings underscore the need for robust hybrid RL algorithms and demonstrate that the B4MRL benchmarks can drive progress toward more reliable, safe, and transferable RL systems in real-world settings.

Abstract

In many reinforcement learning (RL) applications one cannot easily let the agent act in the world; this is true for autonomous vehicles, healthcare applications, and even some recommender systems, to name a few examples. Offline RL provides a way to train agents without real-world exploration, but is often faced with biases due to data distribution shifts, limited coverage, and incomplete representation of the environment. To address these issues, practical applications have tried to combine simulators with grounded offline data, using so-called hybrid methods. However, constructing a reliable simulator is in itself often challenging due to intricate system complexities as well as missing or incomplete information. In this work, we outline four principal challenges for combining offline data with imperfect simulators in RL: simulator modeling error, partial observability, state and action discrepancies, and hidden confounding. To help drive the RL community to pursue these problems, we construct ``Benchmarks for Mechanistic Offline Reinforcement Learning'' (B4MRL), which provide dataset-simulator benchmarks for the aforementioned challenges. Our results suggest the key necessity of such benchmarks for future research.
Paper Structure (24 sections, 5 figures, 8 tables, 1 algorithm)

This paper contains 24 sections, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: An illustration of the discrepancies and biases arising when training RL agents. Modeling error refers to the discrepancy between the real world dynamics and the simulator, e.g. transition error. Confounding error refers to bias due to the dataset not including factors affecting the behavioral policy. Other challenges include limited exploration, partial observability and state and action discrepancies, as detailed in \ref{['section: challenges']}
  • Figure 2: Both figures represent the causal graph of a POMDP. While in both cases the state $s$ is not observed, only in figure (a) $s$ acts as confounder, as actions in the data were taken w.r.t. the unobserved $s$.
  • Figure 3: Results on HalfCheetah environment for modeling error and partial observability. In both figures, the algorithms have access to the standard D4RL datasets, but use different types of imperfect simulators. For modeling error (a) we introduced an error in the transition function by setting the gravitational parameter to $g=19.6$ instead of $9.81$, and for partial observations (b) we added Gaussian noise ($\sigma=0.05$) to the full state.
  • Figure 4: Hybrid algorithms results on medium (M), medium-replay (MR) and medium-expert (ME) datasets, under 3 different challenges. The leftmost column shows results when the simulator has an error in the gravitational parameter ($g$). The middle column shows results when the simulator has either high observational noise ($\sigma_\text{high}$) or is missing an important variable ($h_\text{high}$). Both algorithms demonstrate high capabilities in challenges 1 and 2, suggesting that they were able to bypass errors in the transition function, and in the observational function of the simulator. The rightmost column (challenges 4 & 1) shows results when the data has confounding in the form of either high observational noise or missing an important variable and the simulator has an error in the gravitational parameter. Dashed lines connect experiments done using the same algorithm and the same type of dataset (e.g., HyMOPO with medium dataset).
  • Figure 5: Results of offline (TD3-BC) and online (SAC) algorithms on the HalfCheetah environment with a single missing variable. TD3-BC runs on the medium-expert dataset. For each label on the x-axis, SAC trained on partially observed simulator that lacks that variable, and TD3-BC trained on a dataset that did not have any information about that variable, despite it being used by the agent which generated the dataset.