Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics

Ming-Hong Chen; Kuan-Chen Pan; You-De Huang; Xi Liu; Ping-Chun Hsieh

Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics

Ming-Hong Chen, Kuan-Chen Pan, You-De Huang, Xi Liu, Ping-Chun Hsieh

Abstract

Cross-domain reinforcement learning (CDRL) is meant to improve the data efficiency of RL by leveraging the data samples collected from a source domain to facilitate the learning in a similar target domain. Despite its potential, cross-domain transfer in RL is known to have two fundamental and intertwined challenges: (i) The source and target domains can have distinct state space or action space, and this makes direct transfer infeasible and thereby requires more sophisticated inter-domain mappings; (ii) The transferability of a source-domain model in RL is not easily identifiable a priori, and hence CDRL can be prone to negative effect during transfer. In this paper, we propose to jointly tackle these two challenges through the lens of \textit{cross-domain Bellman consistency} and \textit{hybrid critic}. Specifically, we first introduce the notion of cross-domain Bellman consistency as a way to measure transferability of a source-domain model. Then, we propose $Q$Avatar, which combines the Q functions from both the source and target domains with an adaptive hyperparameter-free weight function. Through this design, we characterize the convergence behavior of $Q$Avatar and show that $Q$Avatar achieves reliable transfer in the sense that it effectively leverages a source-domain Q function for knowledge transfer to the target domain. Through experiments, we demonstrate that $Q$Avatar achieves favorable transferability across various RL benchmark tasks, including locomotion and robot arm manipulation. Our code is available at https://rl-bandits-lab.github.io/Cross-Domain-RL/.

Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics

Abstract

Avatar, which combines the Q functions from both the source and target domains with an adaptive hyperparameter-free weight function. Through this design, we characterize the convergence behavior of

Avatar and show that

Avatar achieves reliable transfer in the sense that it effectively leverages a source-domain Q function for knowledge transfer to the target domain. Through experiments, we demonstrate that

Avatar achieves favorable transferability across various RL benchmark tasks, including locomotion and robot arm manipulation. Our code is available at https://rl-bandits-lab.github.io/Cross-Domain-RL/.

Paper Structure (47 sections, 15 theorems, 63 equations, 15 figures, 4 tables, 3 algorithms)

This paper contains 47 sections, 15 theorems, 63 equations, 15 figures, 4 tables, 3 algorithms.

Introduction
Related Work
Preliminaries
Inter-Domain Mapping Functions.
Tabular Approximate Q-Natural Policy Gradient.
Methodology
Cross-Domain Bellman Consistency
The QAvatar Algorithm
Theoretical Justification of QAvatar
Practical Implementation of QAvatar
Experiments
Setup
Experimental Results
Concluding Remarks
Supporting Lemmas
...and 32 more sections

Key Result

Proposition 1

Under the tabular and approximate-Q settings, and Assumption ass:mu, the average sub-optimality of Q-NPG over $T$ iterations is upper bounded by where $C_0:={2C_{\pi^*}/(1-\gamma)}$ and $C_1:={2C_{\pi^*}/{((1-\gamma)^3 \mu_{\mathop{\mathrm{\text{tar, min}}}\nolimits}})}$.

Figures (15)

Figure 1: Training curves of $Q$Avatar and benchmark methods: (a)-(b) Locomotion tasks; (d)-(e) Robot arm manipulation tasks in Robosuite; (f) Navigation task from CarGoal0 to DoggoGoal0.
Figure 1: Time to threshold of $Q$Avatar and SAC.
Figure 2: Aggregated IQMs (with 95% stratified bootstrap CIs) across tasks.
Figure 3: The training curve and the values of $\alpha(t)$ for $Q$Avatar under strongly positive and strongly negative transfer scenarios.
Figure 4: The training curve and the values of $\alpha(t)$ in the Cheetah environment with a low-quality source model.
...and 10 more figures

Theorems & Definitions (32)

Definition 1: Coverage
Definition 2: TD Error
Proposition 1
Definition 3: Cross-Domain Bellman Error
Proposition 2
Definition 4: Cross-Domain Bellman Consistency
Remark 1
Definition 5: Cross-Domain Action Value Function
Proposition 3
Lemma 1: Performance difference lemma
...and 22 more

Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics

Abstract

Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics

Authors

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (32)