Table of Contents
Fetching ...

Robot Policy Learning with Temporal Optimal Transport Reward

Yuwei Fu, Haichao Zhang, Di Wu, Wei Xu, Benoit Boulet

TL;DR

The Temporal Optimal Transport (TemporalOT) reward is introduced to incorporate temporal order information for learning a more accurate OT-based proxy reward.

Abstract

Reward specification is one of the most tricky problems in Reinforcement Learning, which usually requires tedious hand engineering in practice. One promising approach to tackle this challenge is to adopt existing expert video demonstrations for policy learning. Some recent work investigates how to learn robot policies from only a single/few expert video demonstrations. For example, reward labeling via Optimal Transport (OT) has been shown to be an effective strategy to generate a proxy reward by measuring the alignment between the robot trajectory and the expert demonstrations. However, previous work mostly overlooks that the OT reward is invariant to temporal order information, which could bring extra noise to the reward signal. To address this issue, in this paper, we introduce the Temporal Optimal Transport (TemporalOT) reward to incorporate temporal order information for learning a more accurate OT-based proxy reward. Extensive experiments on the Meta-world benchmark tasks validate the efficacy of the proposed method. Code is available at: https://github.com/fuyw/TemporalOT

Robot Policy Learning with Temporal Optimal Transport Reward

TL;DR

The Temporal Optimal Transport (TemporalOT) reward is introduced to incorporate temporal order information for learning a more accurate OT-based proxy reward.

Abstract

Reward specification is one of the most tricky problems in Reinforcement Learning, which usually requires tedious hand engineering in practice. One promising approach to tackle this challenge is to adopt existing expert video demonstrations for policy learning. Some recent work investigates how to learn robot policies from only a single/few expert video demonstrations. For example, reward labeling via Optimal Transport (OT) has been shown to be an effective strategy to generate a proxy reward by measuring the alignment between the robot trajectory and the expert demonstrations. However, previous work mostly overlooks that the OT reward is invariant to temporal order information, which could bring extra noise to the reward signal. To address this issue, in this paper, we introduce the Temporal Optimal Transport (TemporalOT) reward to incorporate temporal order information for learning a more accurate OT-based proxy reward. Extensive experiments on the Meta-world benchmark tasks validate the efficacy of the proposed method. Code is available at: https://github.com/fuyw/TemporalOT

Paper Structure

This paper contains 38 sections, 17 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: An illustration of the pipeline of applying OT-based reward in RL. In this toy example, we rollout two agent for five steps of transitions. Both agents start from the initial state and take same actions $a_0$ and $a_1$ at the first two states. Then the two agents take different actions $a^a_2$ and $a^b_2$ to generate different trajectories $\tau_a = (o_0, a_0, o_1, a_1, o_2, a^a_2, o^a_3, a^a_3, o^a_4, a^a_4, o^a_5)$ and $\tau_b = (o_0, a_0, o_1, a_1, o_2, a^b_2, o^b_3, a^b_3, o^b_4, a^b_4, o^b_5)$ The OT rewards for $(o_0, a_0)$ and $(o_1, a_1)$ in $\tau^a$ and $\tau^b$ are different even though the state-action pairs are exactly the same.
  • Figure 2: Why OT reward could be useful? When the OT reward is generally correct, it helps to rank the goodness of different states and induce the policy to take better actions. (left) In the toy example, two agents takes different action $a^a_2$ and $a^b_2$ at $o_2$ and thereafter. The goodness of $a^a_2$ and $a^b_2$ is measured by the OT reward computed w.r.t. to the observation of the next state $o^a_3$ and $o^b_3$. (right) A comparison of the true OT reward curves for trajectory $\tau^a$ and $\tau^b$, where $o_0$/$o_1$/$o_2$/$o_3$/$o_4$/$o_5$ correspond to observations at the 0/20/40/60/80/100-th step. We can observe that the OT reward for trajectory b is generally larger, which shows that the OT reward is generally correct.
  • Figure 3: An illustration of the proposed TemporalOT method. (left) Instead of using a pair-wise cosine similarity as the transport cost, we use a group-wise cosine similarity to learn a more accurate cost matrix. (right) We use a temporal mask to enforce the OT reward to focus on a narrow scope to avoid potential distractions from observations outside of the mask window.
  • Figure 4: Ablation for model components. Both proposed components are useful.
  • Figure 5: Influences of key parameters. A medium number of context length $k_c$ or mask length $k_m$ performs the best. The agent performs better with more expert demonstrations.
  • ...and 6 more figures