Table of Contents
Fetching ...

Reinforcement Learning with Videos: Combining Offline Observations with Interaction

Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, Chelsea Finn

TL;DR

The paper tackles the data efficiency challenge in reinforcement learning by enabling robots to learn from videos of humans that lack action and reward annotations. It introduces RL with Videos (RLV), which uses two replay buffers, domain-invariant representations, an inverse model to infer robot actions from observations, and a simple SQIL-like reward scheme, all trained end-to-end with adversarial domain adaptation. Empirical results across suboptimal demonstrations, simulated vision-based tasks, and real-world human videos show that RLV significantly reduces required samples and can handle large domain shifts between human and robot data, often outperforming imitation-from-observation baselines. This approach demonstrates the practical potential of leveraging abundant human video data to accelerate robot learning in vision-based manipulation tasks.

Abstract

Reinforcement learning is a powerful framework for robots to acquire skills from experience, but often requires a substantial amount of online data collection. As a result, it is difficult to collect sufficiently diverse experiences that are needed for robots to generalize broadly. Videos of humans, on the other hand, are a readily available source of broad and interesting experiences. In this paper, we consider the question: can we perform reinforcement learning directly on experience collected by humans? This problem is particularly difficult, as such videos are not annotated with actions and exhibit substantial visual domain shift relative to the robot's embodiment. To address these challenges, we propose a framework for reinforcement learning with videos (RLV). RLV learns a policy and value function using experience collected by humans in combination with data collected by robots. In our experiments, we find that RLV is able to leverage such videos to learn challenging vision-based skills with less than half as many samples as RL methods that learn from scratch.

Reinforcement Learning with Videos: Combining Offline Observations with Interaction

TL;DR

The paper tackles the data efficiency challenge in reinforcement learning by enabling robots to learn from videos of humans that lack action and reward annotations. It introduces RL with Videos (RLV), which uses two replay buffers, domain-invariant representations, an inverse model to infer robot actions from observations, and a simple SQIL-like reward scheme, all trained end-to-end with adversarial domain adaptation. Empirical results across suboptimal demonstrations, simulated vision-based tasks, and real-world human videos show that RLV significantly reduces required samples and can handle large domain shifts between human and robot data, often outperforming imitation-from-observation baselines. This approach demonstrates the practical potential of leveraging abundant human video data to accelerate robot learning in vision-based manipulation tasks.

Abstract

Reinforcement learning is a powerful framework for robots to acquire skills from experience, but often requires a substantial amount of online data collection. As a result, it is difficult to collect sufficiently diverse experiences that are needed for robots to generalize broadly. Videos of humans, on the other hand, are a readily available source of broad and interesting experiences. In this paper, we consider the question: can we perform reinforcement learning directly on experience collected by humans? This problem is particularly difficult, as such videos are not annotated with actions and exhibit substantial visual domain shift relative to the robot's embodiment. To address these challenges, we propose a framework for reinforcement learning with videos (RLV). RLV learns a policy and value function using experience collected by humans in combination with data collected by robots. In our experiments, we find that RLV is able to leverage such videos to learn challenging vision-based skills with less than half as many samples as RL methods that learn from scratch.

Paper Structure

This paper contains 20 sections, 4 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Reinforcement learning with videos. We study the setting where observational data is available, in the form of videos (top left). Our method can leverage such data to improve reinforcement learning by adding the videos to the replay buffer and directly performing RL on the observational data, while overcoming the challenges of unknown actions and domain shift between observation and interaction data.
  • Figure 2: Components of reinforcement learning with offline videos. Left: a batch with samples ($\mathbf{s}_{int}, \mathbf{a}_{int}, \mathbf{s}_{int}', \mathbf{r}_{int}$) is sampled from the action-conditioned replay pool, $\mathcal{D}_{int}$, and the observations are encoded into features $\mathbf{h}_{int}, \mathbf{h}_{int}'$. An inverse model is trained to predict the action $\mathbf{a}_{int}$ from the features $\mathbf{h}_{int}, \mathbf{h}_{int}'$. Middle: the inverse model is used to predict the missing actions in the offline videos, $\mathbf{\hat{a}}_{int}$, in the robot's action space, from features $(\mathbf{h}_{obs}, \mathbf{h}_{obs}')$ that were extracted from observations $(\mathbf{s}_{obs}, \mathbf{s}_{obs}')$. To obtain the missing rewards $\mathbf{\hat{r}}_{obs}$, we label the final step in the trajectory with a large reward and other steps with a small reward. Right: we use adversarial domain confusion to align the features from the action-conditioned data, $\mathbf{h}_{int}$ with the features from the action-free data, $\mathbf{h}_{obs}$. Finally, we use an off-policy reinforcement learning algorithm on the resulting batch $(\left (\mathbf{h}_{int}, \mathbf{h}_{obs}), (\mathbf{a}_{int}, \mathbf{\hat{a}}_{int}), (\mathbf{h}_{int}', \mathbf{h}_{obs}'), (\mathbf{r}_{int}, \mathbf{\hat{r}}_{obs}) \right)$. By overcoming the challenges of missing actions, rewards, and the presence of domain shift, we are able to effectively use the observation data to improve performance of a reinforcement learning agent.
  • Figure 3: Performance on Acrobot with different qualities of observation data. RLV is generally able to achieve equal or higher final rewards than the competing methods while training with fewer samples. The performance is especially notable with medium-quality observation data.
  • Figure 4: Randomly selected trajectories from the policy learned by RLV in different environments. Both trajectories were successful.
  • Figure 5: Rewards for the State Pusher, Visual Door Opening, and the Visual Pusher environments. In both simulated environments, the agent trained with RLV requires fewer samples to solve the task than conventional reinforcement learning.
  • ...and 7 more figures