Vision-Based Hand Shadowing for Robotic Manipulation via Inverse Kinematics

Hendrik Chiche; Antoine Jamme; Trevor Rigoberto Martinez

Vision-Based Hand Shadowing for Robotic Manipulation via Inverse Kinematics

Hendrik Chiche, Antoine Jamme, Trevor Rigoberto Martinez

Abstract

Teleoperation of low-cost robotic manipulators remains challenging due to the complexity of mapping human hand articulations to robot joint commands. We present an offline hand-shadowing and retargeting pipeline from a single egocentric RGB-D camera mounted on 3D-printed glasses. The pipeline detects 21 hand landmarks per hand using MediaPipe Hands, deprojects them into 3D via depth sensing, transforms them into the robot coordinate frame, and solves a damped-least-squares inverse kinematics problem in PyBullet to produce joint commands for the 6-DOF SO-ARM101 robot. A gripper controller maps thumb-index finger geometry to grasp aperture with a four-level fallback hierarchy. Actions are first previewed in a physics simulation before replay on the physical robot through the LeRobot framework. We evaluate the IK retargeting pipeline on a structured pick-and-place benchmark (5-tile grid, 10 grasps per tile) achieving a 90% success rate, and compare it against four vision-language-action policies (ACT, SmolVLA, pi0.5, GR00T N1.5) trained on leader-follower teleoperation data. We also test the IK pipeline in unstructured real-world environments (grocery store, pharmacy), where hand occlusion by surrounding objects reduces success to 9.3% (N=75), highlighting both the promise and current limitations of marker-free analytical retargeting.

Vision-Based Hand Shadowing for Robotic Manipulation via Inverse Kinematics

Abstract

Paper Structure (33 sections, 12 equations, 9 figures, 5 tables)

This paper contains 33 sections, 12 equations, 9 figures, 5 tables.

Introduction
Related Work
Hand Pose Estimation
Teleoperation Systems
Imitation Learning
Low-Cost Robotics
Physics Simulation
System Overview
Hardware
Methods
RGB-D Capture and Camera Model
Hand Pose Estimation
Depth-Based 3D Reconstruction
Camera-to-Robot Coordinate Transform
Target Pose Computation
...and 18 more sections

Figures (9)

Figure 1: System architecture. The six-stage pipeline converts egocentric RGB-D observations into robot joint commands. Stages 1--5 run sequentially; the output drives either the PyBullet simulation for preview or the physical SO-ARM101 for deployment.
Figure 2: Lab bench setup. The SO-ARM101 robot arm is mounted on a wooden base with the Intel RealSense D400 camera on a stand above, oriented in the same egocentric direction as the glasses-mounted camera.
Figure 3: PyBullet simulation preview. Top-left: RGB frame from the egocentric camera showing the operator's hand. Top-right: depth colour map. Bottom: robot arm in PyBullet with debug joint labels and IK target markers (green/red spheres), tracking the hand-derived trajectory.
Figure 4: Hand shadowing. Left: the operator wearing the RealSense glasses performs a grasp. Right: the SO-ARM101 robot mirrors the hand pose through the IK pipeline.
Figure 5: Pick-and-place benchmark task. The SO-ARM101 robot grasps the purple cube and places it in the box. The tile grid (#1--#9) is marked with orange tape; the box sits to the left. The RealSense camera is visible on the stand above.
...and 4 more figures

Vision-Based Hand Shadowing for Robotic Manipulation via Inverse Kinematics

Abstract

Vision-Based Hand Shadowing for Robotic Manipulation via Inverse Kinematics

Authors

Abstract

Table of Contents

Figures (9)