Table of Contents
Fetching ...

Hybrid Cross-domain Robust Reinforcement Learning

Linh Le Pham Van, Minh Hoang Nguyen, Hung Le, Hung The Tran, Sunil Gupta

TL;DR

This work tackles offline reinforcement learning under distributional robustness when training data are scarce and dynamics differ between offline targets and online sources. It introduces HYDRO, a Hybrid Cross-domain Robust RL framework that leverages an online source simulator to augment limited offline target data, while using uncertainty-based filtering and priority sampling to minimize the impact of dynamics mismatch. The authors provide a theoretical framework, including a convergence guarantee and a performance bound that quantifies domain gaps, and demonstrate that HYDRO achieves superior robustness and data efficiency across MuJoCo tasks compared to strong baselines. The approach offers practical gains for real-world applications where collecting exhaustive offline data is costly, and dynamics mismatch between simulators and targets is prevalent.

Abstract

Robust reinforcement learning (RL) aims to learn policies that remain effective despite uncertainties in its environment, which frequently arise in real-world applications due to variations in environment dynamics. The robust RL methods learn a robust policy by maximizing value under the worst-case models within a predefined uncertainty set. Offline robust RL algorithms are particularly promising in scenarios where only a fixed dataset is available and new data cannot be collected. However, these approaches often require extensive offline data, and gathering such datasets for specific tasks in specific environments can be both costly and time-consuming. Using an imperfect simulator offers a faster, cheaper, and safer way to collect data for training, but it can suffer from dynamics mismatch. In this paper, we introduce HYDRO, the first Hybrid Cross-Domain Robust RL framework designed to address these challenges. HYDRO utilizes an online simulator to complement the limited amount of offline datasets in the non-trivial context of robust RL. By measuring and minimizing performance gaps between the simulator and the worst-case models in the uncertainty set, HYDRO employs novel uncertainty filtering and prioritized sampling to select the most relevant and reliable simulator samples. Our extensive experiments demonstrate HYDRO's superior performance over existing methods across various tasks, underscoring its potential to improve sample efficiency in offline robust RL.

Hybrid Cross-domain Robust Reinforcement Learning

TL;DR

This work tackles offline reinforcement learning under distributional robustness when training data are scarce and dynamics differ between offline targets and online sources. It introduces HYDRO, a Hybrid Cross-domain Robust RL framework that leverages an online source simulator to augment limited offline target data, while using uncertainty-based filtering and priority sampling to minimize the impact of dynamics mismatch. The authors provide a theoretical framework, including a convergence guarantee and a performance bound that quantifies domain gaps, and demonstrate that HYDRO achieves superior robustness and data efficiency across MuJoCo tasks compared to strong baselines. The approach offers practical gains for real-world applications where collecting exhaustive offline data is costly, and dynamics mismatch between simulators and targets is prevalent.

Abstract

Robust reinforcement learning (RL) aims to learn policies that remain effective despite uncertainties in its environment, which frequently arise in real-world applications due to variations in environment dynamics. The robust RL methods learn a robust policy by maximizing value under the worst-case models within a predefined uncertainty set. Offline robust RL algorithms are particularly promising in scenarios where only a fixed dataset is available and new data cannot be collected. However, these approaches often require extensive offline data, and gathering such datasets for specific tasks in specific environments can be both costly and time-consuming. Using an imperfect simulator offers a faster, cheaper, and safer way to collect data for training, but it can suffer from dynamics mismatch. In this paper, we introduce HYDRO, the first Hybrid Cross-Domain Robust RL framework designed to address these challenges. HYDRO utilizes an online simulator to complement the limited amount of offline datasets in the non-trivial context of robust RL. By measuring and minimizing performance gaps between the simulator and the worst-case models in the uncertainty set, HYDRO employs novel uncertainty filtering and prioritized sampling to select the most relevant and reliable simulator samples. Our extensive experiments demonstrate HYDRO's superior performance over existing methods across various tasks, underscoring its potential to improve sample efficiency in offline robust RL.

Paper Structure

This paper contains 22 sections, 4 theorems, 33 equations, 4 figures, 1 table, 1 algorithm.

Key Result

theorem thmcountertheorem

Let $\mathcal{M}_{src}$ and $\mathcal{M}_{r}$ be the source MDP and the target RMDP with different dynamics $P_{src}$ and $P^o$ respectively. Consider the RMDP with the TV uncertainty set. Denote: where, given a policy $\pi$, $P^{\pi, \mathcal{U}^\sigma_{TV}(P^o)}$, $P^{\pi, \mathcal{U}^\sigma_{TV}(\hat{P}^o)}$ denote the worst case model w.r.t. the uncertainty set around the target model $P^o$ a

Figures (4)

  • Figure 1: Problem of existing offline robust RL model: Robustness performance drops significantly when training data decreases. Figure illustrates performance comparison under 'front_joint_stiffness' perturbation of offline robust RL panaganti2022robust with different training data sizes from HalfCheetah medium dataset (D4RL).
  • Figure 2: Cumulative rewards of different methods in three Mujoco benchmarks under perturbation. The lines are the average returns over 30 different seeded runs, and the shaded areas represent standard deviation.
  • Figure 3: (a) Robust performance comparison between HYDRO, RFQI, and its variations using naive combination of source and target data. (b-c) Robust performance comparison between HYDRO and its variants without priority sampling and uncertainty filter.
  • Figure 4: Average priority scores of random and priority sampling.

Theorems & Definitions (4)

  • theorem thmcountertheorem: Performance Bound
  • theorem thmcountertheorem: Convergence
  • lemma thmcounterlemma
  • theorem thmcountertheorem: Performance Bound