Table of Contents
Fetching ...

COSBO: Conservative Offline Simulation-Based Policy Optimization

Eshagh Kargar, Ville Kyrki

TL;DR

The paper tackles offline reinforcement learning under data constraints and sim-to-real gaps by proposing COSBO, a simulation-enhanced offline RL method that does not learn a transition model. COSBO uses simulator rollouts with near-target dynamics to conservatively update the Q-function, creating a tighter lower bound on the true value and enabling better policy improvement from mixed offline and simulated data. Empirical results on MuJoCo Hopper/Walker2d and D4RL datasets show COSBO consistently outperforms state-of-the-art baselines such as CQL, MOPO, and COMBO, while maintaining robustness under substantial dynamics mismatch. This approach demonstrates the practical viability of leveraging simulation data to augment offline RL in scenarios where real-world interaction is limited.

Abstract

Offline reinforcement learning allows training reinforcement learning models on data from live deployments. However, it is limited to choosing the best combination of behaviors present in the training data. In contrast, simulation environments attempting to replicate the live environment can be used instead of the live data, yet this approach is limited by the simulation-to-reality gap, resulting in a bias. In an attempt to get the best of both worlds, we propose a method that combines an imperfect simulation environment with data from the target environment, to train an offline reinforcement learning policy. Our experiments demonstrate that the proposed method outperforms state-of-the-art approaches CQL, MOPO, and COMBO, especially in scenarios with diverse and challenging dynamics, and demonstrates robust behavior across a variety of experimental conditions. The results highlight that using simulator-generated data can effectively enhance offline policy learning despite the sim-to-real gap, when direct interaction with the real-world is not possible.

COSBO: Conservative Offline Simulation-Based Policy Optimization

TL;DR

The paper tackles offline reinforcement learning under data constraints and sim-to-real gaps by proposing COSBO, a simulation-enhanced offline RL method that does not learn a transition model. COSBO uses simulator rollouts with near-target dynamics to conservatively update the Q-function, creating a tighter lower bound on the true value and enabling better policy improvement from mixed offline and simulated data. Empirical results on MuJoCo Hopper/Walker2d and D4RL datasets show COSBO consistently outperforms state-of-the-art baselines such as CQL, MOPO, and COMBO, while maintaining robustness under substantial dynamics mismatch. This approach demonstrates the practical viability of leveraging simulation data to augment offline RL in scenarios where real-world interaction is limited.

Abstract

Offline reinforcement learning allows training reinforcement learning models on data from live deployments. However, it is limited to choosing the best combination of behaviors present in the training data. In contrast, simulation environments attempting to replicate the live environment can be used instead of the live data, yet this approach is limited by the simulation-to-reality gap, resulting in a bias. In an attempt to get the best of both worlds, we propose a method that combines an imperfect simulation environment with data from the target environment, to train an offline reinforcement learning policy. Our experiments demonstrate that the proposed method outperforms state-of-the-art approaches CQL, MOPO, and COMBO, especially in scenarios with diverse and challenging dynamics, and demonstrates robust behavior across a variety of experimental conditions. The results highlight that using simulator-generated data can effectively enhance offline policy learning despite the sim-to-real gap, when direct interaction with the real-world is not possible.
Paper Structure (14 sections, 5 equations, 5 figures, 1 table)

This paper contains 14 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The proposed COSBO framework.
  • Figure 2: Comparison to baselines in the Hopper and the Walker2d environments while using the D4RL dataset only.
  • Figure 3: Comparison to baselines in the Hopper and the Walker2d environments while using the D4RL+simulation data (medium change).
  • Figure 4: Comparison to baselines in the Hopper environment with varying dynamics. (a) Using D4RL+simulation data (very change). (b) Using D4RL+simulation data (extreme change).
  • Figure 5: Comparison to baselines in the Walker2d environment with varying dynamics. (a) Using D4RL+simulation data (very change). (b) Using D4RL+simulation data (extreme change).