Dexterous Pre-grasp Manipulation for Human-like Functional Categorical Grasping: Deep Reinforcement Learning and Grasp Representations
Dmytro Pavlichenko, Sven Behnke
TL;DR
The paper tackles dexterous pre-grasp manipulation to achieve functional grasps for human-oriented tools using a single, data-driven DRL policy trained from scratch. It introduces two grasp representations—explicit target grasps and constraint-based constraints—and a dense, multi-component reward to guide learning without demonstrations, coupled with a curriculum to speed convergence. Across drills, spray bottles, and mugs, the approach achieves high success on both seen and unseen instances (roughly 94% test accuracy for explicit targets and ~90% for constraint-based targets), with human-like strategies such as repositioning, reorienting, and up-righting the objects learned autonomously. The work discusses limitations for real-world transfer, proposing a concrete three-step path including distillation of privileged observations, occlusion-aware rewards, and sim-to-real fine-tuning to bridge the gap to practical deployment.
Abstract
Many objects, such as tools and household items, can be used only if grasped in a very specific way - grasped functionally. Often, a direct functional grasp is not possible, though. We propose a method for learning a dexterous pre-grasp manipulation policy to achieve human-like functional grasps using deep reinforcement learning. We introduce a dense multi-component reward function that enables learning a single policy, capable of dexterous pre-grasp manipulation of novel instances of several known object categories with an anthropomorphic hand. The policy is learned purely by means of reinforcement learning from scratch, without any expert demonstrations. It implicitly learns to reposition and reorient objects of complex shapes to achieve given functional grasps. In addition, we explore two different ways to represent a desired grasp: explicit and more abstract, constraint-based. We show that our method consistently learns to successfully manipulate and achieve desired grasps on previously unseen object instances of known categories using both grasp representations. Training is completed on a single GPU in under three hours.
