Table of Contents
Fetching ...

RL-Driven Data Generation for Robust Vision-Based Dexterous Grasping

Atsushi Kanehira, Naoki Wake, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

TL;DR

The work tackles the data scarcity problem in vision-based dexterous grasping by introducing an RL-driven data-generation pipeline that produces diverse, contact-rich grasp trajectories. A parameterized grasp skill combines a reference trajectory $\\xi_t(z,o)$ with a residual policy $\\pi_\\theta(s_t)$, yielding actions $a_t = \\\xi_t(z,o) + \\\pi_\\theta(s_t)$ that adapt to object geometry described by $\\phi$ (superquadrics). The generated simulations populate a dataset $\\mathcal{D}$, which is used to fine-tune a vision-conditioned imitation-learning policy (Octo-Medium), demonstrating improved generalization to unseen object shapes when real and simulated data are mixed. Real-world experiments show that a mixed-data approach achieves robust 100% success across both in-distribution and out-of-distribution objects, highlighting the effectiveness of simulation-based data augmentation for scalable dexterous manipulation and the viability of real-to-sim-to-real transfer.

Abstract

This work presents reinforcement learning (RL)-driven data augmentation to improve the generalization of vision-action (VA) models for dexterous grasping. While real-to-sim-to-real frameworks, where a few real demonstrations seed large-scale simulated data, have proven effective for VA models, applying them to dexterous settings remains challenging: obtaining stable multi-finger contacts is nontrivial across diverse object shapes. To address this, we leverage RL to generate contact-rich grasping data across varied geometries. In line with the real-to-sim-to-real paradigm, the grasp skill is formulated as a parameterized and tunable reference trajectory refined by a residual policy learned via RL. This modular design enables trajectory-level control that is both consistent with real demonstrations and adaptable to diverse object geometries. A vision-conditioned policy trained on simulation-augmented data demonstrates strong generalization to unseen objects, highlighting the potential of our approach to alleviate the data bottleneck in training VA models.

RL-Driven Data Generation for Robust Vision-Based Dexterous Grasping

TL;DR

The work tackles the data scarcity problem in vision-based dexterous grasping by introducing an RL-driven data-generation pipeline that produces diverse, contact-rich grasp trajectories. A parameterized grasp skill combines a reference trajectory with a residual policy , yielding actions that adapt to object geometry described by (superquadrics). The generated simulations populate a dataset , which is used to fine-tune a vision-conditioned imitation-learning policy (Octo-Medium), demonstrating improved generalization to unseen object shapes when real and simulated data are mixed. Real-world experiments show that a mixed-data approach achieves robust 100% success across both in-distribution and out-of-distribution objects, highlighting the effectiveness of simulation-based data augmentation for scalable dexterous manipulation and the viability of real-to-sim-to-real transfer.

Abstract

This work presents reinforcement learning (RL)-driven data augmentation to improve the generalization of vision-action (VA) models for dexterous grasping. While real-to-sim-to-real frameworks, where a few real demonstrations seed large-scale simulated data, have proven effective for VA models, applying them to dexterous settings remains challenging: obtaining stable multi-finger contacts is nontrivial across diverse object shapes. To address this, we leverage RL to generate contact-rich grasping data across varied geometries. In line with the real-to-sim-to-real paradigm, the grasp skill is formulated as a parameterized and tunable reference trajectory refined by a residual policy learned via RL. This modular design enables trajectory-level control that is both consistent with real demonstrations and adaptable to diverse object geometries. A vision-conditioned policy trained on simulation-augmented data demonstrates strong generalization to unseen objects, highlighting the potential of our approach to alleviate the data bottleneck in training VA models.

Paper Structure

This paper contains 16 sections, 4 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Overview of the RL-driven data collection framework. Grasping trajectories are generated by executing a parameterized grasp skill in simulation. The skill uses a residual RL policy trained with privileged information and is conditioned on a skill parameter $z$ (e.g., approach direction). The simulated environment includes a single object, parameterized by $\phi$. Both $z$ and $\phi$, along with other simulation factors (e.g., camera pose and lighting), are sampled from distributions derived from heuristics or estimated from real demonstrations. The resulting simulated trajectories are used to train a vision-to-action imitation learning model.
  • Figure 2: Schematic illustration of grasp skill execution. The reference trajectory $\xi_t(z, o)$ (orange dots), which defines the nominal end-effector pose and finger joint configuration, is refined at each timestep by the residual policy $\pi_\theta(s_t)$ (blue dots). The final robot action $a_t = \xi_t(z, o) + \pi_\theta(s_t)$ (black) incorporates both position and finger corrections.
  • Figure 3: Visualization of object shape variation by sweeping the superquadric parameter $\varepsilon_2$ from a small positive value to 2.0. Each shape is rendered with the same scale, and all other parameters are fixed. This illustrates how geometric smoothness and edge sharpness change as $\varepsilon_2$ increases.
  • Figure 4: Real-world setup used for evaluation. A UR10e arm with a Shadow Dexterous Hand Lite performs grasping tasks while receiving RGB input from a fixed ZED2i camera.
  • Figure 5: Example simulated grasp outcomes generated by our framework across diverse object shapes, used as training data for vision-based policies.
  • ...and 2 more figures