Table of Contents
Fetching ...

Exploiting Information Theory for Intuitive Robot Programming of Manual Activities

Elena Merlo, Marta Lagomarsino, Edoardo Lamon, Arash Ajoudani

TL;DR

This work presents a one-shot framework to program robots from RGB videos by extracting high-level task structure with Shannon Information Theory. It builds frame-wise scene graphs from 6D hand/object poses, uses entropy and mutual information to identify active interactions, and segments the demonstration into Interaction Units that map to motion primitives. These primitives are composed into Behavior Trees to generate robot execution plans that generalize to new object poses and environments. The approach achieves a 92% success rate on a multi-subject dataset and demonstrates plan generalization in a Franka Panda robot, highlighting practical benefits for intuitive, data-efficient robot programming. The HANDSOME dataset is released to promote further research and benchmarking in semantic scene understanding for manipulation.

Abstract

Observational learning is a promising approach to enable people without expertise in programming to transfer skills to robots in a user-friendly manner, since it mirrors how humans learn new behaviors by observing others. Many existing methods focus on instructing robots to mimic human trajectories, but motion-level strategies often pose challenges in skills generalization across diverse environments. This paper proposes a novel framework that allows robots to achieve a higher-level understanding of human-demonstrated manual tasks recorded in RGB videos. By recognizing the task structure and goals, robots generalize what observed to unseen scenarios. We found our task representation on Shannon's Information Theory (IT), which is applied for the first time to manual tasks. IT helps extract the active scene elements and quantify the information shared between hands and objects. We exploit scene graph properties to encode the extracted interaction features in a compact structure and segment the demonstration into blocks, streamlining the generation of Behavior Trees for robot replicas. Experiments validated the effectiveness of IT to automatically generate robot execution plans from a single human demonstration. Additionally, we provide HANDSOME, an open-source dataset of HAND Skills demOnstrated by Multi-subjEcts, to promote further research and evaluation in this field.

Exploiting Information Theory for Intuitive Robot Programming of Manual Activities

TL;DR

This work presents a one-shot framework to program robots from RGB videos by extracting high-level task structure with Shannon Information Theory. It builds frame-wise scene graphs from 6D hand/object poses, uses entropy and mutual information to identify active interactions, and segments the demonstration into Interaction Units that map to motion primitives. These primitives are composed into Behavior Trees to generate robot execution plans that generalize to new object poses and environments. The approach achieves a 92% success rate on a multi-subject dataset and demonstrates plan generalization in a Franka Panda robot, highlighting practical benefits for intuitive, data-efficient robot programming. The HANDSOME dataset is released to promote further research and benchmarking in semantic scene understanding for manipulation.

Abstract

Observational learning is a promising approach to enable people without expertise in programming to transfer skills to robots in a user-friendly manner, since it mirrors how humans learn new behaviors by observing others. Many existing methods focus on instructing robots to mimic human trajectories, but motion-level strategies often pose challenges in skills generalization across diverse environments. This paper proposes a novel framework that allows robots to achieve a higher-level understanding of human-demonstrated manual tasks recorded in RGB videos. By recognizing the task structure and goals, robots generalize what observed to unseen scenarios. We found our task representation on Shannon's Information Theory (IT), which is applied for the first time to manual tasks. IT helps extract the active scene elements and quantify the information shared between hands and objects. We exploit scene graph properties to encode the extracted interaction features in a compact structure and segment the demonstration into blocks, streamlining the generation of Behavior Trees for robot replicas. Experiments validated the effectiveness of IT to automatically generate robot execution plans from a single human demonstration. Additionally, we provide HANDSOME, an open-source dataset of HAND Skills demOnstrated by Multi-subjEcts, to promote further research and evaluation in this field.

Paper Structure

This paper contains 27 sections, 5 equations, 20 figures, 2 tables, 2 algorithms.

Figures (20)

  • Figure 1: The overall structure of the proposed framework is composed of two main blocks: (i) scene representation, and (ii) automatic map to robot instructions.
  • Figure 2: (left) The scenario of an object being relocated is presented. $X{ (t)}$ represents its $1$D position signal over time. $H(X{ (t)})$ is computed by shifting $w$ over $X{ (t)}$ and computing entropy at each time point considering samples included by $w$. The bell-shaped curve flags the position variation. (right) The case where a hand moves an object is depicted. The $1$D position signal of the hand is denoted as $Y{ (t)}$ and that of the object as $X{ (t)}$. By computing $MI(X:Y)$ at each time point while shifting the window $w$, a bell-shaped curve is obtained corresponding to when the hand and the object move together.
  • Figure 3: Detection of Hand-Object interactions ($HO$), categorized into manipulation and contact-only$HO$.
  • Figure 4: Detection of dynamic Object-Object interactions ($OO$). If conditions are satisfied, the manipulated object $o_m$ and a generic object $o_j$ within the scene are considered a manipulated unity $u_m$.
  • Figure 5: Detection of static Object-Object interactions ($OO$). According to the trend of the entropy of objects' average distance $H(\overline{d}_{o_m, o_j})$, $OO$ can be either significant or temporary.
  • ...and 15 more figures