Table of Contents
Fetching ...

Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation

Archit Sharma, Dharmendra Sharma, John Rebeiro, Peeyush Thakur, Narendra Dhar, Laxmidhar Behera

TL;DR

This work tackles the problem of following free-form natural-language commands on resource-constrained robots using a fully on-device two-stage pipeline. It introduces Instruct2Act, a compact BiLSTM with a multi-head attention autoencoder to map instructions to ordered sub-action sequences, and RAN, which uses DATRN for trajectory learning complemented by a vision-grounded environment analyzer to execute actions via a PD-controlled, on-device controller. The approach achieves high sub-action prediction accuracy ($91.5\%$) and solid end-to-end success across four manipulation tasks ($90\%$ overall), with sub-action inference under $3.8$ s and end-to-end execution in $30$–$60$ s depending on task complexity. The results demonstrate practical, real-time manipulation in single-camera, resource-limited settings and highlight a path toward robust, offline, real-world robotic automation, while noting challenges with long-horizon instructions and proposing dataset expansion and lightweight transformer alternatives as future work.

Abstract

Robots often struggle to follow free-form human instructions in real-world settings due to computational and sensing limitations. We address this gap with a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation. Our approach has two stages: (i) the instruction to actions module (Instruct2Act), a compact BiLSTM with a multi-head-attention autoencoder that parses an instruction into an ordered sequence of atomic actions (e.g., reach, grasp, move, place); and (ii) the robot action network (RAN), which uses the dynamic adaptive trajectory radial network (DATRN) together with a vision-based environment analyzer (YOLOv8) to generate precise control trajectories for each sub-action. The entire system runs on a modest system with no cloud services. On our custom proprietary dataset, Instruct2Act attains 91.5% sub-actions prediction accuracy while retaining a small footprint. Real-robot evaluations across four tasks (pick-place, pick-pour, wipe, and pick-give) yield an overall 90% success; sub-action inference completes in < 3.8s, with end-to-end executions in 30-60s depending on task complexity. These results demonstrate that fine-grained instruction-to-action parsing, coupled with DATRN-based trajectory generation and vision-guided grounding, provides a practical path to deterministic, real-time manipulation in resource-constrained, single-camera settings.

Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation

TL;DR

This work tackles the problem of following free-form natural-language commands on resource-constrained robots using a fully on-device two-stage pipeline. It introduces Instruct2Act, a compact BiLSTM with a multi-head attention autoencoder to map instructions to ordered sub-action sequences, and RAN, which uses DATRN for trajectory learning complemented by a vision-grounded environment analyzer to execute actions via a PD-controlled, on-device controller. The approach achieves high sub-action prediction accuracy () and solid end-to-end success across four manipulation tasks ( overall), with sub-action inference under s and end-to-end execution in s depending on task complexity. The results demonstrate practical, real-time manipulation in single-camera, resource-limited settings and highlight a path toward robust, offline, real-world robotic automation, while noting challenges with long-horizon instructions and proposing dataset expansion and lightweight transformer alternatives as future work.

Abstract

Robots often struggle to follow free-form human instructions in real-world settings due to computational and sensing limitations. We address this gap with a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation. Our approach has two stages: (i) the instruction to actions module (Instruct2Act), a compact BiLSTM with a multi-head-attention autoencoder that parses an instruction into an ordered sequence of atomic actions (e.g., reach, grasp, move, place); and (ii) the robot action network (RAN), which uses the dynamic adaptive trajectory radial network (DATRN) together with a vision-based environment analyzer (YOLOv8) to generate precise control trajectories for each sub-action. The entire system runs on a modest system with no cloud services. On our custom proprietary dataset, Instruct2Act attains 91.5% sub-actions prediction accuracy while retaining a small footprint. Real-robot evaluations across four tasks (pick-place, pick-pour, wipe, and pick-give) yield an overall 90% success; sub-action inference completes in < 3.8s, with end-to-end executions in 30-60s depending on task complexity. These results demonstrate that fine-grained instruction-to-action parsing, coupled with DATRN-based trajectory generation and vision-guided grounding, provides a practical path to deterministic, real-time manipulation in resource-constrained, single-camera settings.
Paper Structure (35 sections, 8 equations, 8 figures, 4 tables)

This paper contains 35 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of Instruct2Act and RAN.
  • Figure 2: Overall framework: The Instruct2Act provides a task plan, i.e., a sequence of identified sub-actions and the objects from the user's input. The environment analyzer then checks for the target object; if available, the robot execution model executes the sub-actions.
  • Figure 3: Workflow of RAN.
  • Figure 4: Training and reconstruction loss curves for the model.
  • Figure 5: Confusion matrix of Instruct2Act on test set.
  • ...and 3 more figures