Table of Contents
Fetching ...

Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning

Jingyun Yang, Max Sobol Mark, Brandon Vu, Archit Sharma, Jeannette Bohg, Chelsea Finn

TL;DR

Robots learning new manipulation tasks suffer from data inefficiency and heavy human supervision due to reward specification and environment resets. The authors propose RoboFuME, which pre-trains a language-conditioned multi-task policy from diverse offline robot data using CalQL and then fine-tunes online with a vision-language model (VLM) reward predictor in a reset-free loop. The approach integrates offline-to-online calibration, task-conditioned representations, and autonomous reward labeling, and it is validated on five real-world manipulation tasks with as little as 3 hours of autonomous learning. Across simulated experiments, CalQL-based fine-tuning and the VLM reward consistently outperform baselines, demonstrating improved data efficiency and robustness to distribution shifts.

Abstract

The pre-train and fine-tune paradigm in machine learning has had dramatic success in a wide range of domains because the use of existing data or pre-trained models on the internet enables quick and easy learning of new tasks. We aim to enable this paradigm in robotic reinforcement learning, allowing a robot to learn a new task with little human effort by leveraging data and models from the Internet. However, reinforcement learning often requires significant human effort in the form of manual reward specification or environment resets, even if the policy is pre-trained. We introduce RoboFuME, a reset-free fine-tuning system that pre-trains a multi-task manipulation policy from diverse datasets of prior experiences and self-improves online to learn a target task with minimal human intervention. Our insights are to utilize calibrated offline reinforcement learning techniques to ensure efficient online fine-tuning of a pre-trained policy in the presence of distribution shifts and leverage pre-trained vision language models (VLMs) to build a robust reward classifier for autonomously providing reward signals during the online fine-tuning process. In a diverse set of five real robot manipulation tasks, we show that our method can incorporate data from an existing robot dataset collected at a different institution and improve on a target task within as little as 3 hours of autonomous real-world experience. We also demonstrate in simulation experiments that our method outperforms prior works that use different RL algorithms or different approaches for predicting rewards. Project website: https://robofume.github.io

Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning

TL;DR

Robots learning new manipulation tasks suffer from data inefficiency and heavy human supervision due to reward specification and environment resets. The authors propose RoboFuME, which pre-trains a language-conditioned multi-task policy from diverse offline robot data using CalQL and then fine-tunes online with a vision-language model (VLM) reward predictor in a reset-free loop. The approach integrates offline-to-online calibration, task-conditioned representations, and autonomous reward labeling, and it is validated on five real-world manipulation tasks with as little as 3 hours of autonomous learning. Across simulated experiments, CalQL-based fine-tuning and the VLM reward consistently outperform baselines, demonstrating improved data efficiency and robustness to distribution shifts.

Abstract

The pre-train and fine-tune paradigm in machine learning has had dramatic success in a wide range of domains because the use of existing data or pre-trained models on the internet enables quick and easy learning of new tasks. We aim to enable this paradigm in robotic reinforcement learning, allowing a robot to learn a new task with little human effort by leveraging data and models from the Internet. However, reinforcement learning often requires significant human effort in the form of manual reward specification or environment resets, even if the policy is pre-trained. We introduce RoboFuME, a reset-free fine-tuning system that pre-trains a multi-task manipulation policy from diverse datasets of prior experiences and self-improves online to learn a target task with minimal human intervention. Our insights are to utilize calibrated offline reinforcement learning techniques to ensure efficient online fine-tuning of a pre-trained policy in the presence of distribution shifts and leverage pre-trained vision language models (VLMs) to build a robust reward classifier for autonomously providing reward signals during the online fine-tuning process. In a diverse set of five real robot manipulation tasks, we show that our method can incorporate data from an existing robot dataset collected at a different institution and improve on a target task within as little as 3 hours of autonomous real-world experience. We also demonstrate in simulation experiments that our method outperforms prior works that use different RL algorithms or different approaches for predicting rewards. Project website: https://robofume.github.io
Paper Structure (11 sections, 4 figures, 4 tables, 1 algorithm)

This paper contains 11 sections, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 2: Illustrations of the five real-world evaluation tasks. (a) Sweep candies to the top of the tray. (b) fold the yellow cloth. (c) cover a red wooden cube using the cloth. (d) place the lid on top of the metallic pot. (e) move the orange pot from the sink to the drying rack.
  • Figure 3: Performance of our method on three simulated environments. We report the success rate over the course of training, averaged over three seeds. Our method RoboFuME outperforms BC, ARIEL+VLM walke2023don, and MEDAL++ sharma2023self consistently on all three domains.
  • Figure 4: Performance of our method on the Vase simulated task with different actor-critic update objectives. Fine-tuning with CalQL is critical to obtain stable improvements on this task, as training with CQL, AWAC, or SAC yields poor performance. We also find that language conditioned policies perform slightly better than one-hot task IDs in simulation.
  • Figure 5: Performance of our method on the simulated Vase task using different reward functions. Our method uses a fine-tuned VLM reward function and outperforms VICE rewards, whereas CNN and VIP rewards fail to improve online.