ScrewMimic: Bimanual Imitation from Human Videos with Screw Space Projection

Arpit Bahety; Priyanka Mandikal; Ben Abbatematteo; Roberto Martín-Martín

ScrewMimic: Bimanual Imitation from Human Videos with Screw Space Projection

Arpit Bahety, Priyanka Mandikal, Ben Abbatematteo, Roberto Martín-Martín

TL;DR

ScrewMimic tackles bimanual manipulation by projecting human demonstrations into a screw-axis space that jointly constrains both hands. It introduces screw actions $\sigma=(g_l,g_r,S,\tau_l)$ and leverages a three-module pipeline—perception of demonstrations, point-cloud-based screw-action prediction, and self-supervised CEM-based fine-tuning—to learn from a single video. Empirical results across six real tasks show high success rates and improved generalization, with ablations confirming the benefits of the screw representation and autonomous reward signals. This approach offers a practical, scalable path for robots to acquire complex coordinated manipulation skills from human videos with minimal human supervision.

Abstract

Bimanual manipulation is a longstanding challenge in robotics due to the large number of degrees of freedom and the strict spatial and temporal synchronization required to generate meaningful behavior. Humans learn bimanual manipulation skills by watching other humans and by refining their abilities through play. In this work, we aim to enable robots to learn bimanual manipulation behaviors from human video demonstrations and fine-tune them through interaction. Inspired by seminal work in psychology and biomechanics, we propose modeling the interaction between two hands as a serial kinematic linkage -- as a screw motion, in particular, that we use to define a new action space for bimanual manipulation: screw actions. We introduce ScrewMimic, a framework that leverages this novel action representation to facilitate learning from human demonstration and self-supervised policy fine-tuning. Our experiments demonstrate that ScrewMimic is able to learn several complex bimanual behaviors from a single human video demonstration, and that it outperforms baselines that interpret demonstrations and fine-tune directly in the original space of motion of both arms. For more information and video results, https://robin-lab.cs.utexas.edu/ScrewMimic/

ScrewMimic: Bimanual Imitation from Human Videos with Screw Space Projection

TL;DR

ScrewMimic tackles bimanual manipulation by projecting human demonstrations into a screw-axis space that jointly constrains both hands. It introduces screw actions

and leverages a three-module pipeline—perception of demonstrations, point-cloud-based screw-action prediction, and self-supervised CEM-based fine-tuning—to learn from a single video. Empirical results across six real tasks show high success rates and improved generalization, with ablations confirming the benefits of the screw representation and autonomous reward signals. This approach offers a practical, scalable path for robots to acquire complex coordinated manipulation skills from human videos with minimal human supervision.

Abstract

Paper Structure (17 sections, 3 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 17 sections, 3 equations, 11 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Bimanual Manipulation
Visual Imitation Learning
Preliminaries: Screw Theory
ScrewMimic: Policy Learning with Screw Actions
Extracting a Screw Action from a Human Demonstration
Predicting a Screw Action from a Point Cloud
Self-Supervised Screw-Action Policy Fine-Tuning
Experimental Evaluation
Lessons and Conclusion
Screw Action with Left-Hand Trajectory
Hyperparameters
Using Pretrained PointNet Model
Robustness to Noisy Demonstrations
...and 2 more sections

Figures (11)

Figure 1: Bimanual manipulation tasks can be represented by a screw axis (red line) constraining and synchronizing the motion of both hands. ScrewMimic maps a single human demonstration into a screw axis, improves it with an iterative interactive exploration procedure, and learns to predict it for new object instances and poses, enabling their manipulation.
Figure 2: Overview of ScrewMimic.a) Given an RGB-D video of a human performing a bimanual task, we use off-the-shelf hand tracking (HT) models rong2021frankmocapshan2020 to extract a trajectory of wrist poses $\tau^h$ and grasp contact points $(g_l^h,g_r^h)$. ScrewMimic interprets $\tau^h$ as a screw motion between both hands to estimate screw axis parameters $S^h$ (Sec. \ref{['sec:method_A']}). b) Next, we apply geometric augmentations on the 3D object point cloud to train a PointNet qi2016pointnet model to estimate screw actions for novel object views (Sec. \ref{['s_prediction']}). c) Finally, the trained model generates an initial hypothesis that the robot executes and iteratively refines using an autonomously generated reward signal. The successful data point is further used to improve the prediction model (Sec. \ref{['sec:fine-tuning']}).
Figure 3: Human demonstrations as screw actions. Three frames of a human demonstration for three bimanual tasks (top row: opening a bottle, $m=$revolute, middle row: stirring a pot, $m=$revolute3D, bottom row: opening a zipper, $m=$prismatic) and the perceived screw axis explaining the motion (fourth column, orange indicates the axis line). Our screw action representation facilitates the interpretation of noisy hand trajectory observations in a bimanual interaction as evidence of a simple 1-DoF constraint between both hands
Figure 4: Screw Action Fine-tuning and Prediction Model Re-training Result. The first column shows the human demonstration for each task. The second column shows the axis predicted by M1, the model trained on the axis extracted from the human demonstration, with the object at a novel pose. Columns 3-5 show snapshots from an episode in the fine-tuning stage. Column 6 shows the axis corresponding to the successful trajectory obtained during the aforementioned process. Column 7 shows the predicted axis for a novel object pose from the prediction model re-trained on the corrected axis. This result shows how the robot starts from a noisy screw axis and using the screw action fine-tuning, corrects the axis. Furthermore, it also shows that this corrected axis can be used to re-train the prediction model to output a more accurate axis.
Figure 5: Generalization to New Objects. The first column shows the axis predicted by M2, the model trained on the corrected screw action for the first object. Columns 2-4 show snapshots from an episode in the fine-tuning stage. Column 5 shows the axis corresponding to the successful trajectory obtained during the aforementioned process. Column 6 shows the predicted axis from the prediction model re-trained on the corrected axis (M3). Thus, ScrewMimic can obtain reasonable screw action predictions and fine-tune them to generalize to new objects.
...and 6 more figures

ScrewMimic: Bimanual Imitation from Human Videos with Screw Space Projection

TL;DR

Abstract

ScrewMimic: Bimanual Imitation from Human Videos with Screw Space Projection

Authors

TL;DR

Abstract

Table of Contents

Figures (11)