Table of Contents
Fetching ...

HAND Me the Data: Fast Robot Adaptation via Hand Path Retrieval

Matthew Hong, Anthony Liang, Kevin Kim, Harshitha Rajaprakash, Jesse Thomason, Erdem Bıyık, Jesse Zhang

TL;DR

This work tackles rapid robot adaptation to new tasks using only a single human hand demonstration and a task-agnostic robot play dataset. It introduces HAND, a two-stage retrieval framework that uses 2D hand-paths and a visual filter to retrieve relevant robot sub-trajectories, followed by parameter-efficient policy fine-tuning with LoRA adapters. The approach achieves real-time learning in under four minutes on real robots and outperforms retrieval baselines by substantial margins, even with hand demonstrations from unseen scenes and camera angles. The findings highlight the practicality of hand-path retrieval for scalable, data-efficient robot learning in human-centric settings, with implications for non-expert users and rapid task deployment.

Abstract

We hand the community HAND, a simple and time-efficient method for teaching robots new manipulation tasks through human hand demonstrations. Instead of relying on task-specific robot demonstrations collected via teleoperation, HAND uses easy-to-provide hand demonstrations to retrieve relevant behaviors from task-agnostic robot play data. Using a visual tracking pipeline, HAND extracts the motion of the human hand from the hand demonstration and retrieves robot sub-trajectories in two stages: first filtering by visual similarity, then retrieving trajectories with similar behaviors to the hand. Fine-tuning a policy on the retrieved data enables real-time learning of tasks in under four minutes, without requiring calibrated cameras or detailed hand pose estimation. Experiments also show that HAND outperforms retrieval baselines by over 2x in average task success rates on real robots. Videos can be found at our project website: https://liralab.usc.edu/handretrieval/.

HAND Me the Data: Fast Robot Adaptation via Hand Path Retrieval

TL;DR

This work tackles rapid robot adaptation to new tasks using only a single human hand demonstration and a task-agnostic robot play dataset. It introduces HAND, a two-stage retrieval framework that uses 2D hand-paths and a visual filter to retrieve relevant robot sub-trajectories, followed by parameter-efficient policy fine-tuning with LoRA adapters. The approach achieves real-time learning in under four minutes on real robots and outperforms retrieval baselines by substantial margins, even with hand demonstrations from unseen scenes and camera angles. The findings highlight the practicality of hand-path retrieval for scalable, data-efficient robot learning in human-centric settings, with implications for non-expert users and rapid task deployment.

Abstract

We hand the community HAND, a simple and time-efficient method for teaching robots new manipulation tasks through human hand demonstrations. Instead of relying on task-specific robot demonstrations collected via teleoperation, HAND uses easy-to-provide hand demonstrations to retrieve relevant behaviors from task-agnostic robot play data. Using a visual tracking pipeline, HAND extracts the motion of the human hand from the hand demonstration and retrieves robot sub-trajectories in two stages: first filtering by visual similarity, then retrieving trajectories with similar behaviors to the hand. Fine-tuning a policy on the retrieved data enables real-time learning of tasks in under four minutes, without requiring calibrated cameras or detailed hand pose estimation. Experiments also show that HAND outperforms retrieval baselines by over 2x in average task success rates on real robots. Videos can be found at our project website: https://liralab.usc.edu/handretrieval/.

Paper Structure

This paper contains 11 sections, 2 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 2: HAND enables fast-adaptation to a new target task by using an easy-to-provide hand demonstration of the target task (Left). We propose a two-step retrieval procedure where we first filter the trajectories in the offline play dataset, $\mathcal{D}_\text{play}$, for visually similar trajectories based on features from a pretrained vision model. We use off-the-shelf, pretrained hand detection and point tracking to construct 2D paths of the motion for both the human hand and robot end-effector. We use these paths as a distance metric to retrieve relevant trajectories from the play dataset (Middle) for quickly fine-tuning a pretrained transformer policy on the target task (Right).
  • Figure 3: WidowX Robot Arm Setup. We evaluate the scalability of HAND on 10 manipulation tasks on a WidowX robot arm in a kitchen setup walke2023bridgedata.
  • Figure 4: Qualitative retrieval results on OOD scene. We visualize the top sub-trajectory match of Flow , STRAP , HAND without visual filtering (HAND(-VF)), and HAND on two OOD domain demonstrations recorded from an iPhone camera, showing approaching a K-Cup and putting it into the machine. Only HAND's top match is relevant for both hand demonstrations.
  • Figure 5: Real-Robot Results. Task completion (including partial completion) out of 10 of $\pi_\text{base}$, STRAP, Flow, and HAND.
  • Figure 6: Fast Adaptation Study. We conduct a small-scale user study to demonstrate HAND's ability to learn robot policies in real-time. From providing the hand demonstration (Left), to retrieval and fine-tuning a base policy (Middle), to evaluating the policy (Right), we show HAND can learn to put a carrot in the blender with 7.5/10 task completion in less than 4 minutes.