RL-Driven Data Generation for Robust Vision-Based Dexterous Grasping
Atsushi Kanehira, Naoki Wake, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi
TL;DR
The work tackles the data scarcity problem in vision-based dexterous grasping by introducing an RL-driven data-generation pipeline that produces diverse, contact-rich grasp trajectories. A parameterized grasp skill combines a reference trajectory $\\xi_t(z,o)$ with a residual policy $\\pi_\\theta(s_t)$, yielding actions $a_t = \\\xi_t(z,o) + \\\pi_\\theta(s_t)$ that adapt to object geometry described by $\\phi$ (superquadrics). The generated simulations populate a dataset $\\mathcal{D}$, which is used to fine-tune a vision-conditioned imitation-learning policy (Octo-Medium), demonstrating improved generalization to unseen object shapes when real and simulated data are mixed. Real-world experiments show that a mixed-data approach achieves robust 100% success across both in-distribution and out-of-distribution objects, highlighting the effectiveness of simulation-based data augmentation for scalable dexterous manipulation and the viability of real-to-sim-to-real transfer.
Abstract
This work presents reinforcement learning (RL)-driven data augmentation to improve the generalization of vision-action (VA) models for dexterous grasping. While real-to-sim-to-real frameworks, where a few real demonstrations seed large-scale simulated data, have proven effective for VA models, applying them to dexterous settings remains challenging: obtaining stable multi-finger contacts is nontrivial across diverse object shapes. To address this, we leverage RL to generate contact-rich grasping data across varied geometries. In line with the real-to-sim-to-real paradigm, the grasp skill is formulated as a parameterized and tunable reference trajectory refined by a residual policy learned via RL. This modular design enables trajectory-level control that is both consistent with real demonstrations and adaptable to diverse object geometries. A vision-conditioned policy trained on simulation-augmented data demonstrates strong generalization to unseen objects, highlighting the potential of our approach to alleviate the data bottleneck in training VA models.
