Table of Contents
Fetching ...

GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation

Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, Wanli Peng, Jingchao Qiao, Zeyu Ren, Haixin Shi, Zhi Su, Jiawen Tian, Yuyang Xiao, Shenyu Zhang, Liwei Zheng, Hang Li, Yonghui Wu

TL;DR

GR-RL tackles the gap between generalist vision-language-action policies and reliable, long-horizon dexterous manipulation by introducing a reinforcement-augmented training pipeline. It filters, augments, and reinforces demonstrations through offline progress-based filtering, symmetry-based data augmentation, and online RL with a latent-space noise predictor, all built on a Mixture-of-Transformer architecture. The method achieves 83.3% success in shoe-lacing, a long-horizon, millimeter-precision task, demonstrating the practical viability of specialized policies derived from generalist foundations. This work highlights a path toward turning broad robotic foundations into reliable, real-world experts for challenging manipulation tasks.

Abstract

We present GR-RL, a robotic learning framework that turns a generalist vision-language-action (VLA) policy into a highly capable specialist for long-horizon dexterous manipulation. Assuming the optimality of human demonstrations is core to existing VLA policies. However, we claim that in highly dexterous and precise manipulation tasks, human demonstrations are noisy and suboptimal. GR-RL proposes a multi-stage training pipeline that filters, augments, and reinforces the demonstrations by reinforcement learning. First, GR-RL learns a vision-language-conditioned task progress, filters the demonstration trajectories, and only keeps the transitions that contribute positively to the progress. Specifically, we show that by directly applying offline RL with sparse reward, the resulting $Q$-values can be treated as a robust progress function. Next, we introduce morphological symmetry augmentation that greatly improves the generalization and performance of GR-RL. Lastly, to better align the VLA policy with its deployment behaviors for high-precision control, we perform online RL by learning a latent space noise predictor. With this pipeline, GR-RL is, to our knowledge, the first learning-based policy that can autonomously lace up a shoe by threading shoelaces through multiple eyelets with an 83.3% success rate, a task requiring long-horizon reasoning, millimeter-level precision, and compliant soft-body interaction. We hope GR-RL provides a step toward enabling generalist robot foundations models to specialize into reliable real-world experts.

GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation

TL;DR

GR-RL tackles the gap between generalist vision-language-action policies and reliable, long-horizon dexterous manipulation by introducing a reinforcement-augmented training pipeline. It filters, augments, and reinforces demonstrations through offline progress-based filtering, symmetry-based data augmentation, and online RL with a latent-space noise predictor, all built on a Mixture-of-Transformer architecture. The method achieves 83.3% success in shoe-lacing, a long-horizon, millimeter-precision task, demonstrating the practical viability of specialized policies derived from generalist foundations. This work highlights a path toward turning broad robotic foundations into reliable, real-world experts for challenging manipulation tasks.

Abstract

We present GR-RL, a robotic learning framework that turns a generalist vision-language-action (VLA) policy into a highly capable specialist for long-horizon dexterous manipulation. Assuming the optimality of human demonstrations is core to existing VLA policies. However, we claim that in highly dexterous and precise manipulation tasks, human demonstrations are noisy and suboptimal. GR-RL proposes a multi-stage training pipeline that filters, augments, and reinforces the demonstrations by reinforcement learning. First, GR-RL learns a vision-language-conditioned task progress, filters the demonstration trajectories, and only keeps the transitions that contribute positively to the progress. Specifically, we show that by directly applying offline RL with sparse reward, the resulting -values can be treated as a robust progress function. Next, we introduce morphological symmetry augmentation that greatly improves the generalization and performance of GR-RL. Lastly, to better align the VLA policy with its deployment behaviors for high-precision control, we perform online RL by learning a latent space noise predictor. With this pipeline, GR-RL is, to our knowledge, the first learning-based policy that can autonomously lace up a shoe by threading shoelaces through multiple eyelets with an 83.3% success rate, a task requiring long-horizon reasoning, millimeter-level precision, and compliant soft-body interaction. We hope GR-RL provides a step toward enabling generalist robot foundations models to specialize into reliable real-world experts.

Paper Structure

This paper contains 27 sections, 4 equations, 8 figures.

Figures (8)

  • Figure 1: GR-RL performs long-horizon, dexterous, and high-precision manipulation, in the task of shoe lacing, by adopting a multi-stage training pipeline, consisting of 1) offline filtered behavior cloning with learned task progress, 2) simple yet effective action augmentation, 3) online reinforcement learning.
  • Figure 2: The GR-RL Model. GR-RL adopts a Mixture-of-Transformer (MoT) architecture. It is co-trained on robot vision-language-action trajectories via a flow-matching objective, and Temporal-Difference (TD) errors via distributional reinforcement learning.
  • Figure 3: Examples of learned task progress.
  • Figure 4: The ByteMini-v2 Robot. We show the robot specifications in terms of sensors, DoFs and electronic devices.
  • Figure 5: Left: the success rate of our multi-stage training recipe. Data filtering, mirror augmentation, and online tuning all contribute to the final performance. Right: the binary success signal per episode (dots) and the moving average of success rate (curve) during online finetuning. The performance increases rapidly after an offline-to-online adaptation phase.
  • ...and 3 more figures