Table of Contents
Fetching ...

DEFT: Dexterous Fine-Tuning for Real-World Hand Policies

Aditya Kannan, Kenneth Shaw, Shikhar Bahl, Pragna Mannam, Deepak Pathak

TL;DR

DEFT tackles data-inefficient real-world dexterous manipulation by fusing human-video-derived affordance priors with online, CEM-based fine-tuning on a soft-hand robot. An affordance module predicts contact location, wrist pose, and post-contact hand configuration from internet videos and language cues, which is refined by a residual policy and a conditional VAE to generalize across objects. Nine diverse tabletop tasks demonstrate that DEFT can rapidly adapt in the real world (often under an hour per task) and outperform zero-shot baselines, with ablations validating the importance of priors and residual learning. Limitations include perception-noise-induced grasp diversity constraints, the need for human resets, and hardware limits on finger curl; addressing these could broaden dexterous capabilities further. Overall, DEFT provides a practical pathway to data-efficient, real-world dexterous manipulation using video-informed priors and online fine-tuning.

Abstract

Dexterity is often seen as a cornerstone of complex manipulation. Humans are able to perform a host of skills with their hands, from making food to operating tools. In this paper, we investigate these challenges, especially in the case of soft, deformable objects as well as complex, relatively long-horizon tasks. However, learning such behaviors from scratch can be data inefficient. To circumvent this, we propose a novel approach, DEFT (DExterous Fine-Tuning for Hand Policies), that leverages human-driven priors, which are executed directly in the real world. In order to improve upon these priors, DEFT involves an efficient online optimization procedure. With the integration of human-based learning and online fine-tuning, coupled with a soft robotic hand, DEFT demonstrates success across various tasks, establishing a robust, data-efficient pathway toward general dexterous manipulation. Please see our website at https://dexterous-finetuning.github.io for video results.

DEFT: Dexterous Fine-Tuning for Real-World Hand Policies

TL;DR

DEFT tackles data-inefficient real-world dexterous manipulation by fusing human-video-derived affordance priors with online, CEM-based fine-tuning on a soft-hand robot. An affordance module predicts contact location, wrist pose, and post-contact hand configuration from internet videos and language cues, which is refined by a residual policy and a conditional VAE to generalize across objects. Nine diverse tabletop tasks demonstrate that DEFT can rapidly adapt in the real world (often under an hour per task) and outperform zero-shot baselines, with ablations validating the importance of priors and residual learning. Limitations include perception-noise-induced grasp diversity constraints, the need for human resets, and hardware limits on finger curl; addressing these could broaden dexterous capabilities further. Overall, DEFT provides a practical pathway to data-efficient, real-world dexterous manipulation using video-informed priors and online fine-tuning.

Abstract

Dexterity is often seen as a cornerstone of complex manipulation. Humans are able to perform a host of skills with their hands, from making food to operating tools. In this paper, we investigate these challenges, especially in the case of soft, deformable objects as well as complex, relatively long-horizon tasks. However, learning such behaviors from scratch can be data inefficient. To circumvent this, we propose a novel approach, DEFT (DExterous Fine-Tuning for Hand Policies), that leverages human-driven priors, which are executed directly in the real world. In order to improve upon these priors, DEFT involves an efficient online optimization procedure. With the integration of human-based learning and online fine-tuning, coupled with a soft robotic hand, DEFT demonstrates success across various tasks, establishing a robust, data-efficient pathway toward general dexterous manipulation. Please see our website at https://dexterous-finetuning.github.io for video results.
Paper Structure (25 sections, 1 equation, 8 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 1 equation, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: We present DEFT, a novel approach that can learn complex, dexterous tasks in the real world in an efficient manner. DEFT manipulates tools and soft objects without any robot demonstrations.
  • Figure 2: Left: DEFT consists of two phases: an affordance model that predicts grasp parameters followed by online fine-tuning with CEM. Right: Our affordance prediction setup predicts grasp location and pose.
  • Figure 3: We produce three priors from human videos: the contact location (top row) and grasp pose (middle row) from the affordance prior; the post-grasp trajectory (bottom row) from a human demonstration of the task.
  • Figure 4: Left: Workspace Setup. We place an Intel RealSense camera above the robot to maintain an egocentric viewpoint, consistent with the affordance model's training data. Right: Thirteen objects used in our experiments.
  • Figure 5: Qualitative results showing the finetuning procedure for DEFT. The model learns to hold the spatula and flip the bagel after 30 CEM iterations.
  • ...and 3 more figures