Table of Contents
Fetching ...

Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation

Thanh Nguyen Canh, Thanh-Tuan Tran, Haolan Zhang, Ziyan Gao, Nak Young Chong, Xiem HoangVan

TL;DR

This work proposes a novel ``Human-to-Robot''imitation learning pipeline that enables robots to acquire manipulation skills directly from unstructured video demonstrations, inspired by the human ability to learn by watching and imitating.

Abstract

Learning from Demonstration (LfD) offers a promising paradigm for robot skill acquisition. Recent approaches attempt to extract manipulation commands directly from video demonstrations, yet face two critical challenges: (1) general video captioning models prioritize global scene features over task-relevant objects, producing descriptions unsuitable for precise robotic execution, and (2) end-to-end architectures coupling visual understanding with policy learning require extensive paired datasets and struggle to generalize across objects and scenarios. To address these limitations, we propose a novel ``Human-to-Robot'' imitation learning pipeline that enables robots to acquire manipulation skills directly from unstructured video demonstrations, inspired by the human ability to learn by watching and imitating. Our key innovation is a modular framework that decouples the learning process into two distinct stages: (1) Video Understanding, which combines Temporal Shift Modules (TSM) with Vision-Language Models (VLMs) to extract actions and identify interacted objects, and (2) Robot Imitation, which employs TD3-based deep reinforcement learning to execute the demonstrated manipulations. We validated our approach in PyBullet simulation environments with a UR5e manipulator and in a real-world experiment with a UF850 manipulator across four fundamental actions: reach, pick, move, and put. For video understanding, our method achieves 89.97% action classification accuracy and BLEU-4 scores of 0.351 on standard objects and 0.265 on novel objects, representing improvements of 76.4% and 128.4% over the best baseline, respectively. For robot manipulation, our framework achieves an average success rate of 87.5% across all actions, with 100% success on reaching tasks and up to 90% on complex pick-and-place operations. The project website is available at https://thanhnguyencanh.github.io/LfD4hri.

Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation

TL;DR

This work proposes a novel ``Human-to-Robot''imitation learning pipeline that enables robots to acquire manipulation skills directly from unstructured video demonstrations, inspired by the human ability to learn by watching and imitating.

Abstract

Learning from Demonstration (LfD) offers a promising paradigm for robot skill acquisition. Recent approaches attempt to extract manipulation commands directly from video demonstrations, yet face two critical challenges: (1) general video captioning models prioritize global scene features over task-relevant objects, producing descriptions unsuitable for precise robotic execution, and (2) end-to-end architectures coupling visual understanding with policy learning require extensive paired datasets and struggle to generalize across objects and scenarios. To address these limitations, we propose a novel ``Human-to-Robot'' imitation learning pipeline that enables robots to acquire manipulation skills directly from unstructured video demonstrations, inspired by the human ability to learn by watching and imitating. Our key innovation is a modular framework that decouples the learning process into two distinct stages: (1) Video Understanding, which combines Temporal Shift Modules (TSM) with Vision-Language Models (VLMs) to extract actions and identify interacted objects, and (2) Robot Imitation, which employs TD3-based deep reinforcement learning to execute the demonstrated manipulations. We validated our approach in PyBullet simulation environments with a UR5e manipulator and in a real-world experiment with a UF850 manipulator across four fundamental actions: reach, pick, move, and put. For video understanding, our method achieves 89.97% action classification accuracy and BLEU-4 scores of 0.351 on standard objects and 0.265 on novel objects, representing improvements of 76.4% and 128.4% over the best baseline, respectively. For robot manipulation, our framework achieves an average success rate of 87.5% across all actions, with 100% success on reaching tasks and up to 90% on complex pick-and-place operations. The project website is available at https://thanhnguyencanh.github.io/LfD4hri.
Paper Structure (28 sections, 12 equations, 15 figures, 8 tables, 1 algorithm)

This paper contains 28 sections, 12 equations, 15 figures, 8 tables, 1 algorithm.

Figures (15)

  • Figure 1: An illustration of Robot Imitation Learning from Human Demonstration.
  • Figure 2: The proposed Video Understanding architecture consists of two parallel branches: (a) Interacted Object Understanding Module and (b) Action Understanding Module. Initially, the raw input frames $\mathbf{F} = \{\mathbf{f}_1, \mathbf{f}_2, \dots, \mathbf{f}_n\}$ are downsampled to $\tilde{\mathbf{F}} = \{\tilde{\mathbf{f}}_1, \tilde{\mathbf{f}}_2, \dots, \tilde{\mathbf{f}}_m\}$ (where $n>m$) to optimize runtime and match the training data frame rate in Action Understanding Module. (a) The Interacted Object Understanding Module processes $\tilde{\mathbf{F}}$ to extract a subset of keyframes $\hat{\mathbf{F}} = \{\hat{\mathbf{f}}_1, \hat{\mathbf{f}}_2, \dots, \hat{\mathbf{f}}_k\}$, (where $m>k$). These frames are analyzed by our Object Selection algorithm and Vision-Language Models (VLMs) to identify the specific objects involved in the interaction accurately. (b) The Action Understanding Module is implemented based on a CNN architecture with a ResNet-50 he2016deep backbone with Temporal Shift Modules (TSM) that shift feature channels along the temporal dimension to capture fine-grained motion dynamics for action classification.
  • Figure 3: Pipeline of the proposed Object Selection algorithm. Cropped object regions are tracked across frames to generate trajectory data, which is used to classify objects into Pickable-object and Placeable-object sets based on motion patterns. Each candidate object then undergoes Blur Detection and Overlap Detection to select the highest-quality object instances for subsequent Vision-Language Model processing.
  • Figure 4: Visualization of Interacted object trajectory tracking across three intersection scenarios. Left: RGB frames showing human hand-object interactions with objects. Right: Corresponding 2D trajectory plots of the tracked objects' centroid positions over time, where the characteristic parabolic or oscillatory patterns indicate pickable objects, enabling automatic classification of object functional roles (pickable or placeable).
  • Figure 5: The architecture of our DRL system for robot imitation. processes the environment observation $\mathcal{O}$ that consists of agent's proprioceptive, exteroceptive, relational, and historical data such as: joint angle $\mathbf{j}_{t}$, joint velocity $\mathbf{j}^\prime_{t}$, object orientation $\mathbf{R}_{o}$, object coordination and target $\mathbf{p}_{o}$, $\mathbf{p}_{g}$, end effector status $\mathbf{p}_{e}$, we also provide relative observations $\bar{\mathbf{p}}_{eo}$, $\bar{\mathbf{p}}_{og}$ with past joint angle $\mathbf{a}_{t-1}$, provided by the Video Understanding framework and outputs the location of the object referenced by $\mathbf{i}_t$.
  • ...and 10 more figures