Table of Contents
Fetching ...

RoboSubtaskNet: Temporal Sub-task Segmentation for Human-to-Robot Skill Transfer in Real-World Environments

Dharmendra Sharma, Archit Sharma, John Reberio, Vaibhav Kesharwani, Peeyush Thakur, Narendra Kumar Dhar, Laxmidhar Behera

TL;DR

RoboSubtaskNet addresses the challenge of temporally segmenting long, untrimmed human demonstrations into robot-executable sub-tasks for real-world human–robot collaboration. It combines attention-enhanced I3D features from RGB and optical flow with a Fibonacci-dilated MS-TCN, plus a transition-aware loss, to produce reliable sub-task sequences that map deterministically to manipulator primitives via a DMP-based execution pipeline. A new RoboSubtask dataset of healthcare and industrial demonstrations is introduced to align vision understanding with robotic control, and the full perception-to-execution system is validated end-to-end on a 7-DOF Kinova Gen3, achieving high task success and practical run-times. The work demonstrates a practical path from fine-grained video understanding to deployable robot manipulation in real-world environments, with strong segmentation performance on benchmarks and robust end-to-end execution in physical trials.

Abstract

Temporally locating and classifying fine-grained sub-task segments in long, untrimmed videos is crucial to safe human-robot collaboration. Unlike generic activity recognition, collaborative manipulation requires sub-task labels that are directly robot-executable. We present RoboSubtaskNet, a multi-stage human-to-robot sub-task segmentation framework that couples attention-enhanced I3D features (RGB plus optical flow) with a modified MS-TCN employing a Fibonacci dilation schedule to capture better short-horizon transitions such as reach-pick-place. The network is trained with a composite objective comprising cross-entropy and temporal regularizers (truncated MSE and a transition-aware term) to reduce over-segmentation and to encourage valid sub-task progressions. To close the gap between vision benchmarks and control, we introduce RoboSubtask, a dataset of healthcare and industrial demonstrations annotated at the sub-task level and designed for deterministic mapping to manipulator primitives. Empirically, RoboSubtaskNet outperforms MS-TCN and MS-TCN++ on GTEA and our RoboSubtask benchmark (boundary-sensitive and sequence metrics), while remaining competitive on the long-horizon Breakfast benchmark. Specifically, RoboSubtaskNet attains F1 @ 50 = 79.5%, Edit = 88.6%, Acc = 78.9% on GTEA; F1 @ 50 = 30.4%, Edit = 52.0%, Acc = 53.5% on Breakfast; and F1 @ 50 = 94.2%, Edit = 95.6%, Acc = 92.2% on RoboSubtask. We further validate the full perception-to-execution pipeline on a 7-DoF Kinova Gen3 manipulator, achieving reliable end-to-end behavior in physical trials (overall task success approx 91.25%). These results demonstrate a practical path from sub-task level video understanding to deployed robotic manipulation in real-world settings.

RoboSubtaskNet: Temporal Sub-task Segmentation for Human-to-Robot Skill Transfer in Real-World Environments

TL;DR

RoboSubtaskNet addresses the challenge of temporally segmenting long, untrimmed human demonstrations into robot-executable sub-tasks for real-world human–robot collaboration. It combines attention-enhanced I3D features from RGB and optical flow with a Fibonacci-dilated MS-TCN, plus a transition-aware loss, to produce reliable sub-task sequences that map deterministically to manipulator primitives via a DMP-based execution pipeline. A new RoboSubtask dataset of healthcare and industrial demonstrations is introduced to align vision understanding with robotic control, and the full perception-to-execution system is validated end-to-end on a 7-DOF Kinova Gen3, achieving high task success and practical run-times. The work demonstrates a practical path from fine-grained video understanding to deployable robot manipulation in real-world environments, with strong segmentation performance on benchmarks and robust end-to-end execution in physical trials.

Abstract

Temporally locating and classifying fine-grained sub-task segments in long, untrimmed videos is crucial to safe human-robot collaboration. Unlike generic activity recognition, collaborative manipulation requires sub-task labels that are directly robot-executable. We present RoboSubtaskNet, a multi-stage human-to-robot sub-task segmentation framework that couples attention-enhanced I3D features (RGB plus optical flow) with a modified MS-TCN employing a Fibonacci dilation schedule to capture better short-horizon transitions such as reach-pick-place. The network is trained with a composite objective comprising cross-entropy and temporal regularizers (truncated MSE and a transition-aware term) to reduce over-segmentation and to encourage valid sub-task progressions. To close the gap between vision benchmarks and control, we introduce RoboSubtask, a dataset of healthcare and industrial demonstrations annotated at the sub-task level and designed for deterministic mapping to manipulator primitives. Empirically, RoboSubtaskNet outperforms MS-TCN and MS-TCN++ on GTEA and our RoboSubtask benchmark (boundary-sensitive and sequence metrics), while remaining competitive on the long-horizon Breakfast benchmark. Specifically, RoboSubtaskNet attains F1 @ 50 = 79.5%, Edit = 88.6%, Acc = 78.9% on GTEA; F1 @ 50 = 30.4%, Edit = 52.0%, Acc = 53.5% on Breakfast; and F1 @ 50 = 94.2%, Edit = 95.6%, Acc = 92.2% on RoboSubtask. We further validate the full perception-to-execution pipeline on a 7-DoF Kinova Gen3 manipulator, achieving reliable end-to-end behavior in physical trials (overall task success approx 91.25%). These results demonstrate a practical path from sub-task level video understanding to deployed robotic manipulation in real-world settings.
Paper Structure (39 sections, 18 equations, 5 figures, 7 tables)

This paper contains 39 sections, 18 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: RoboSubtaskNet pipeline for Human-to-Robot skill transfer.
  • Figure 2: Feature extraction and attention-fusion module. (a) I3D-based feature extractor, (b) Attention fusion schematic.
  • Figure 3: Dilated residual layer with Fibonacci dilation factors.
  • Figure 4: For each task: (a) pick & place, (b) pick & pour, (c) table cleaning, & (d) pick & give. The top images show their human demonstrations, and the bottom images show the corresponding task sequence of the robot (sequence left to right).
  • Figure 5: Qualitative results of RoboSubtaskNet with MS-TCN and MS-TCN++ approaches. One example video is shown from each dataset: (a) GTEA and (b) Breakfast. Two examples (c), (d) from RoboSubtask for pick and place and pick and pour tasks, respectively. For each example, the video collage is presented along with the ground truth and predicted segmentation.