Table of Contents
Fetching ...

When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning

Haoyi Niu, Shubham Sharma, Yiwen Qiu, Ming Li, Guyue Zhou, Jianming Hu, Xianyuan Zhan

TL;DR

This work tackles the practical challenge of deploying RL when simulators have imperfect dynamics and offline data are limited in coverage. It introduces H2O, a dynamics-aware hybrid offline-and-online RL framework that jointly leverages a fixed real-world offline dataset and online simulation rollouts, using a KL-regularized, dynamics-gap–driven sampling distribution $d^phi( s, a)$ and an importance-weighted Bellman error to adapt learning. Theoretical insights reveal that the framework acts as an adaptive reward adjustment $ u( s, a)$ that conservatively underestimates high-gap regions while boosting low-gap regions, enabling safer and more effective policy learning across domains. Empirically, H2O outperforms purely online, offline, and hybrid baselines in MuJoCo HalfCheetah variants with induced dynamics gaps and in real-wheel experiments, demonstrating improved sim-to-real transfer and robustness to data limitations. The results suggest that dynamics-aware hybrid learning can guide future RL algorithm design for real-world tasks where simulators are imperfect and real data are scarce.

Abstract

Learning effective reinforcement learning (RL) policies to solve real-world complex tasks can be quite challenging without a high-fidelity simulation environment. In most cases, we are only given imperfect simulators with simplified dynamics, which inevitably lead to severe sim-to-real gaps in RL policy learning. The recently emerged field of offline RL provides another possibility to learn policies directly from pre-collected historical data. However, to achieve reasonable performance, existing offline RL algorithms need impractically large offline data with sufficient state-action space coverage for training. This brings up a new question: is it possible to combine learning from limited real data in offline RL and unrestricted exploration through imperfect simulators in online RL to address the drawbacks of both approaches? In this study, we propose the Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning (H2O) framework to provide an affirmative answer to this question. H2O introduces a dynamics-aware policy evaluation scheme, which adaptively penalizes the Q function learning on simulated state-action pairs with large dynamics gaps, while also simultaneously allowing learning from a fixed real-world dataset. Through extensive simulation and real-world tasks, as well as theoretical analysis, we demonstrate the superior performance of H2O against other cross-domain online and offline RL algorithms. H2O provides a brand new hybrid offline-and-online RL paradigm, which can potentially shed light on future RL algorithm design for solving practical real-world tasks.

When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning

TL;DR

This work tackles the practical challenge of deploying RL when simulators have imperfect dynamics and offline data are limited in coverage. It introduces H2O, a dynamics-aware hybrid offline-and-online RL framework that jointly leverages a fixed real-world offline dataset and online simulation rollouts, using a KL-regularized, dynamics-gap–driven sampling distribution and an importance-weighted Bellman error to adapt learning. Theoretical insights reveal that the framework acts as an adaptive reward adjustment that conservatively underestimates high-gap regions while boosting low-gap regions, enabling safer and more effective policy learning across domains. Empirically, H2O outperforms purely online, offline, and hybrid baselines in MuJoCo HalfCheetah variants with induced dynamics gaps and in real-wheel experiments, demonstrating improved sim-to-real transfer and robustness to data limitations. The results suggest that dynamics-aware hybrid learning can guide future RL algorithm design for real-world tasks where simulators are imperfect and real data are scarce.

Abstract

Learning effective reinforcement learning (RL) policies to solve real-world complex tasks can be quite challenging without a high-fidelity simulation environment. In most cases, we are only given imperfect simulators with simplified dynamics, which inevitably lead to severe sim-to-real gaps in RL policy learning. The recently emerged field of offline RL provides another possibility to learn policies directly from pre-collected historical data. However, to achieve reasonable performance, existing offline RL algorithms need impractically large offline data with sufficient state-action space coverage for training. This brings up a new question: is it possible to combine learning from limited real data in offline RL and unrestricted exploration through imperfect simulators in online RL to address the drawbacks of both approaches? In this study, we propose the Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning (H2O) framework to provide an affirmative answer to this question. H2O introduces a dynamics-aware policy evaluation scheme, which adaptively penalizes the Q function learning on simulated state-action pairs with large dynamics gaps, while also simultaneously allowing learning from a fixed real-world dataset. Through extensive simulation and real-world tasks, as well as theoretical analysis, we demonstrate the superior performance of H2O against other cross-domain online and offline RL algorithms. H2O provides a brand new hybrid offline-and-online RL paradigm, which can potentially shed light on future RL algorithm design for solving practical real-world tasks.
Paper Structure (36 sections, 4 theorems, 36 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 36 sections, 4 theorems, 36 equations, 7 figures, 7 tables, 1 algorithm.

Key Result

Lemma 1

(A sharpened version of Jensen's inequality liao2018sharpening). Let $X$ be a one-dimensional random variable with $P(X\in (a,b))=1$, where $-\infty \leq a < b \leq \infty$. Let $\varphi(x)$ be a twice differentiable function on $(a,b)$, we have:

Figures (7)

  • Figure 1: Conceptual illustration of the dynamics-aware hybrid offline-and-online RL framework
  • Figure 2: Real-world validation on a wheel-legged robot
  • Figure 3: The dynamics gap measure $u(\mathbf{s}, \mathbf{a})$ evaluated during the training process
  • Figure 4: Single-step reward distribution in human-collected datasets of Standing Still and Moving Straight tasks
  • Figure 5: Cumulative rewards of different baselines recorded in real-world validation
  • ...and 2 more figures

Theorems & Definitions (7)

  • Lemma 1
  • Corollary 1
  • proof
  • Theorem 1
  • proof
  • Theorem 2
  • proof