Table of Contents
Fetching ...

RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning

Charles Xu, Qiyang Li, Jianlan Luo, Sergey Levine

TL;DR

RLDG introduces Reinforcement Learning Distilled Generalists, a method that uses specialist RL policies to autonomously generate high-quality training data for finetuning robotic foundation models. By training RL experts on narrowly scoped tasks and distilling their trajectories into generalist policies (OpenVLA and Octo), the approach achieves higher success rates and better generalization than fine-tuning with human demonstrations, often with 6–10x less data. The results on precise manipulation tasks (e.g., connector insertion, FMB assembly) show substantial gains in both in-distribution and unseen scenarios, and analyses indicate improvements stem from optimized action distributions and improved state coverage, with action quality being the dominant factor. The work demonstrates a practical, scalable pathway to combine task-specific RL with generalist policy distillation, enabling more capable yet flexible robotic manipulation systems that retain foundation-model advantages while approaching specialized performance.

Abstract

Recent advances in robotic foundation models have enabled the development of generalist policies that can adapt to diverse tasks. While these models show impressive flexibility, their performance heavily depends on the quality of their training data. In this work, we propose Reinforcement Learning Distilled Generalists (RLDG), a method that leverages reinforcement learning to generate high-quality training data for finetuning generalist policies. Through extensive real-world experiments on precise manipulation tasks like connector insertion and assembly, we demonstrate that generalist policies trained with RL-generated data consistently outperform those trained with human demonstrations, achieving up to 40% higher success rates while generalizing better to new tasks. We also provide a detailed analysis that reveals this performance gain stems from both optimized action distributions and improved state coverage. Our results suggest that combining task-specific RL with generalist policy distillation offers a promising approach for developing more capable and efficient robotic manipulation systems that maintain the flexibility of foundation models while achieving the performance of specialized controllers. Videos and code can be found on our project website https://generalist-distillation.github.io

RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning

TL;DR

RLDG introduces Reinforcement Learning Distilled Generalists, a method that uses specialist RL policies to autonomously generate high-quality training data for finetuning robotic foundation models. By training RL experts on narrowly scoped tasks and distilling their trajectories into generalist policies (OpenVLA and Octo), the approach achieves higher success rates and better generalization than fine-tuning with human demonstrations, often with 6–10x less data. The results on precise manipulation tasks (e.g., connector insertion, FMB assembly) show substantial gains in both in-distribution and unseen scenarios, and analyses indicate improvements stem from optimized action distributions and improved state coverage, with action quality being the dominant factor. The work demonstrates a practical, scalable pathway to combine task-specific RL with generalist policy distillation, enabling more capable yet flexible robotic manipulation systems that retain foundation-model advantages while approaching specialized performance.

Abstract

Recent advances in robotic foundation models have enabled the development of generalist policies that can adapt to diverse tasks. While these models show impressive flexibility, their performance heavily depends on the quality of their training data. In this work, we propose Reinforcement Learning Distilled Generalists (RLDG), a method that leverages reinforcement learning to generate high-quality training data for finetuning generalist policies. Through extensive real-world experiments on precise manipulation tasks like connector insertion and assembly, we demonstrate that generalist policies trained with RL-generated data consistently outperform those trained with human demonstrations, achieving up to 40% higher success rates while generalizing better to new tasks. We also provide a detailed analysis that reveals this performance gain stems from both optimized action distributions and improved state coverage. Our results suggest that combining task-specific RL with generalist policy distillation offers a promising approach for developing more capable and efficient robotic manipulation systems that maintain the flexibility of foundation models while achieving the performance of specialized controllers. Videos and code can be found on our project website https://generalist-distillation.github.io

Paper Structure

This paper contains 35 sections, 2 equations, 8 figures.

Figures (8)

  • Figure 1: RLDG improves generalist robot policies like OpenVLA and Octo by training with specialist RL policies and using them to generate high-quality fine-tuning datasets. It has the flexibility to distill knowledge from multiple RL policies trained on individual narrowly scoped tasks into a single generalist. It can also be applied to the most critical sub-task of a long-horizon manipulation task, improving the success rate at the "bottleneck" while leveraging human demonstrations on parts of the task where it suffices.
  • Figure 2: We use a Franka Emika Panda arm with a parallel jaw gripper teleoperated by a 3Dconnexion SpaceMouse device. There is a single RealSense D405 camera mounted on the robot's wrist for image observations.
  • Figure 3: Illustrations of tasks used to evaluate RLDG. (A) Precise Connector Insertion includes three training objects and four unseen test objects for evaluating policy generalization. (B) Pick and Place involves an unseen scenario that tests the policy's visual robustness to different backgrounds and objects. (C) FMB Insertion involves inserting a pre-grasped object in a moving board while (D) FMB Assembly starts with the object on the table and involves an additional grasping phase.
  • Figure 4: Success rate comparison of OpenVLA and Octo policies fine-tuned with RLDG versus conventional methods using human demonstrations. Both generalists trained with RLDG consistently outperform their counterparts trained with the same number of successful expert human demonstrations in both training and unseen scenarios.
  • Figure 5: Success rate of OpenVLA policies fine-tuned on different sizes of RL-generated and human-collected datasets. When evaluated on seen (VGA) and unseen (Type C) Connector Insertion tasks, RLDG shows superior sample efficiency, requiring significantly fewer demonstrations to achieve perfect success rate in both scenarios while the performance of conventional method saturates in the unseen case.
  • ...and 3 more figures