EasyMimic: A Low-Cost Framework for Robot Imitation Learning from Human Videos

Tao Zhang; Song Xia; Ye Wang; Qin Jin

EasyMimic: A Low-Cost Framework for Robot Imitation Learning from Human Videos

Tao Zhang, Song Xia, Ye Wang, Qin Jin

TL;DR

EasyMimic addresses the data bottleneck in home-use robot imitation learning by enabling manipulation policies to be learned from consumer RGB videos. It introduces action-space retargeting and lightweight visual augmentation to bridge human-robot embodiment gaps, complemented by a co-training strategy that fuses abundant human demonstrations with limited robot data through a diffusion-transformer-based policy. Implemented on a low-cost LeRobot platform with a sub-$300 setup, it uses HaMeR for 3D hand reconstruction and a center-of-the-thenar-eminence anchor for retargeting, achieving high task success across four tabletop tasks and language-conditioned instructions. The results show consistent gains over robot-only baselines and demonstrate a scalable, user-friendly approach to bringing intelligent household robots to consumers.

Abstract

Robot imitation learning is often hindered by the high cost of collecting large-scale, real-world data. This challenge is especially significant for low-cost robots designed for home use, as they must be both user-friendly and affordable. To address this, we propose the EasyMimic framework, a low-cost and replicable solution that enables robots to quickly learn manipulation policies from human video demonstrations captured with standard RGB cameras. Our method first extracts 3D hand trajectories from the videos. An action alignment module then maps these trajectories to the gripper control space of a low-cost robot. To bridge the human-to-robot domain gap, we introduce a simple and user-friendly hand visual augmentation strategy. We then use a co-training method, fine-tuning a model on both the processed human data and a small amount of robot data, enabling rapid adaptation to new tasks. Experiments on the low-cost LeRobot platform demonstrate that EasyMimic achieves high performance across various manipulation tasks. It significantly reduces the reliance on expensive robot data collection, offering a practical path for bringing intelligent robots into homes. Project website: https://zt375356.github.io/EasyMimic-Project/.

EasyMimic: A Low-Cost Framework for Robot Imitation Learning from Human Videos

TL;DR

Abstract

Paper Structure (24 sections, 1 equation, 6 figures, 7 tables)

This paper contains 24 sections, 1 equation, 6 figures, 7 tables.

INTRODUCTION
Related Work
Robot Data Collection
Learning from Human Videos
Vision Language Action Model
Method
Data Collection Systems and Hardware Design
Physical Alignment
Action Space Alignment
Visual Space Alignment
Training Strategy
Experiments
Experimental Setup
Main Results
Comparison of Training Strategies
...and 9 more sections

Figures (6)

Figure 1: Overview of the EasyMimic framework. The framework learns robotic manipulation from human videos captured with low-cost hardware. To bridge the embodiment gap, it aligns the action and visual spaces via physical alignment. A VLA model is then fine-tuned on the combined data for rapid adaptation to new tasks.
Figure 2: Physical alignment process. Human hand keypoints and meshes are extracted from videos. Hand motion is retargeted to robot actions via the action space alignment module, while the hand mesh is augmented through the visual space alignment module to bridge the physical gap between humans and robots.
Figure 3: Co-training strategy. Human demonstration data and robot teleoperation data are mixed during training. A shared DiT module learns a unified policy representation, while separate action encoders and decoders for each embodiment handle their specific data properties.
Figure 4: Effect of Dataset Size. (a) Varying human data with fixed robot data (10 trajectories). (b) Varying robot data with fixed human data (50 videos).
Figure 5: Case analysis of failure modes across different tasks. (a) Premature gripper release during pick and place. (b) Imprecise handle grasping in drawer manipulation. (c) Collision-induced object falling during stacking. (d) Unstable placement leading to the object falling.
...and 1 more figures

EasyMimic: A Low-Cost Framework for Robot Imitation Learning from Human Videos

TL;DR

Abstract

EasyMimic: A Low-Cost Framework for Robot Imitation Learning from Human Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (6)