RObotic MAnipulation Network (ROMAN) -- Hybrid Hierarchical Learning for Solving Complex Sequential Tasks

Eleftherios Triantafyllidis; Fernando Acero; Zhaocheng Liu; Zhibin Li

RObotic MAnipulation Network (ROMAN) -- Hybrid Hierarchical Learning for Solving Complex Sequential Tasks

Eleftherios Triantafyllidis, Fernando Acero, Zhaocheng Liu, Zhibin Li

TL;DR

ROMAN addresses the challenge of long-horizon robotic manipulation by a Hybrid Hierarchical Learning framework that combines Behavioural Cloning, GAIL, and PPO-based reinforcement learning within a Mixture-of-Experts architecture. A central Manipulation Network gates seven specialized experts to compose in-sequence actions, enabling robust failure recovery and adaptability under exteroceptive noise. The approach is validated via extensive simulations and ablations against monolithic networks, showing robustness to vision and sensor noise, as well as the ability to recover from local minima, thereby generalizing beyond demonstrations. This work demonstrates that balancing imitation with intrinsic and extrinsic rewards yields robust, adaptive manipulation capabilities with practical potential for autonomous robotic systems in real-world tasks.

Abstract

Solving long sequential tasks poses a significant challenge in embodied artificial intelligence. Enabling a robotic system to perform diverse sequential tasks with a broad range of manipulation skills is an active area of research. In this work, we present a Hybrid Hierarchical Learning framework, the Robotic Manipulation Network (ROMAN), to address the challenge of solving multiple complex tasks over long time horizons in robotic manipulation. ROMAN achieves task versatility and robust failure recovery by integrating behavioural cloning, imitation learning, and reinforcement learning. It consists of a central manipulation network that coordinates an ensemble of various neural networks, each specialising in distinct re-combinable sub-tasks to generate their correct in-sequence actions for solving complex long-horizon manipulation tasks. Experimental results show that by orchestrating and activating these specialised manipulation experts, ROMAN generates correct sequential activations for accomplishing long sequences of sophisticated manipulation tasks and achieving adaptive behaviours beyond demonstrations, while exhibiting robustness to various sensory noises. These results demonstrate the significance and versatility of ROMAN's dynamic adaptability featuring autonomous failure recovery capabilities, and highlight its potential for various autonomous manipulation tasks that demand adaptive motor skills.

RObotic MAnipulation Network (ROMAN) -- Hybrid Hierarchical Learning for Solving Complex Sequential Tasks

TL;DR

Abstract

Paper Structure (11 sections, 6 equations, 9 figures, 17 tables, 1 algorithm)

This paper contains 11 sections, 6 equations, 9 figures, 17 tables, 1 algorithm.

Real-world Impact of Intelligent Robotics
Imitation Learning and Learning from Demonstration
Hierarchical Learning
Definition of Success Rate:
Expert Networks -- Evaluation against Increasing Levels of Gaussian Noise:
Manipulation Network -- Evaluation against Increasing Levels of Gaussian Noise:
Evaluation of Vision System:
Behavioural Cloning (Warm-Starting the Policy):
GAIL (Commenced after BC and Active Throughout Training):
Reinforcement Learning (Exploration Beyond Imitation):
Integration of BC, GAIL and RL:

Figures (9)

Figure 1: The capabilities of the hierarchical architecture of the ROMAN framework: A Hybrid Hierarchical Learning (HHL) framework for hierarchical task learning, with the capability of solving significantly long horizon sequential tasks that require the successful activation and coordination of diverse expert skills, commonly necessary in robotics and physics-based interactions. The derivation of high-level specialised experts in ROMAN, allowed the construction of a gating network, referred to as the Manipulation Network (MN), that is trained for elevated task-level scene understandings, the planning and execution of complex sequential long-time horizon tasks for the successful and timely activation of low-level expert networks. A set of seven in total specialised manipulation skills that are common in daily life were derived that can be recombined to create higher level types of manipulation skills. The specialised skills included in the ROMAN framework are: (i) Pushing a Button, (ii) Pushing, (iii) Picking & Inserting, (iv) Picking & Placing, (v) Rotating-Opening, (vi) Picking & Dropping and (vii) Pulling-Opening. Unlike conventional planning methods or state machines, ROMAN exhibits dynamic adaptability in (i) randomised task sequences, (ii) generalisation outside of demonstrated cases as well as (iii) recovery and robustness against local minima. The ability of the gating network (MN) to achieve such versatility and robustness is attributed to: (i) the HHL architecture in ROMAN's core framework as well as (ii) the task decomposition of complex sequences by the various experts in the framework in a high-level manner, allowing, in turn, the central gating network (MN) to be trained on high-level scene understanding and orchestrations of experts. The system architecture is based on the Mixture of Experts (MoE) that is able to successfully adapt to environmental demands, overcome various levels of uncertainties and most importantly learn with minimal human imitation complex sequential manipulation tasks.
Figure 1: The flow chart of the hybrid training procedure. The depiction of the main stages of the training procedure, including the use of demonstrations to warm-start the policy via behavioural cloning (BC). Thereafter the policy is updated following the use of PPO, primarily acting as the general purpose update rule, with extrinsic ($r_E$) and intrinsic ($r_I$) rewards provided by the environment and GAIL's discriminator respectively. This training procedure is employed for all expert NNs incorporated in ROMAN's hierarchical framework. Given the pre-trained expert NNs, the MN is subsequently trained with the same hybrid learning procedure.
Figure 2: ROMAN's ability to adapt to the scenarios beyond the demonstrated sequence and exhibiting behaviour beyond imitation with the most notable one being the dynamic recovery capabilities shown, by virtue of balancing exploitation and exploration via the employed HHL approach.Figures (A) and (B): Policy adaptation of ROMAN during failures concerned with Picking and Placing as well as Pick and Dropping sub-tasks respectively. These intermediate failures are either attributed to individual expert error or a gating network error. In such seldom instances, we show the error cases ($t=1$) of these experts, which however quickly and dynamically re-adapt and re-grasp the items ($t=2$ to $t=4$) to successfully complete the sequence and more broadly the end goal. Figure (C): The ability of the MN of the ROMAN framework to dynamically adapt in cases that were not encountered in the demonstrated sequence, but rather visited states during the RL training as the result of balancing exploitation and exploration from the employed hybrid learning procedure. This balance ultimately resulted in new behaviours beyond imitation, leading to recovery capabilities from local minima. The figure represents 12 snapshots over time with a sequence from left to right and top to bottom, depicting and highlighting the weight assignments by the MN.
Figure 2: Training plot of each individual expert depicting the normalised reward over the environment steps in millions. The figure shows the different training steps of each expert in the ROMAN framework with the returns over the duration of the training steps. Notice that the training requirements in environment steps depend on the nature and complexity of each specialising expert. The most apparent observation is that every expert concerned with a higher-level complexity goal, such as those concerned with Picking & Dropping, Placing or Inserting, were admittedly the most complex and longest in time horizons compared to other experts. As discussed in detail the main manuscript, developing task-specific experts allowed for reducing the subsequent burden on the primarily gating network. This is because rather than learning to schedule low-level sub-tasks, the gating network can focus entirely on orchestrating the higher-level tasks using specialised experts. This approach minimises the amount of unnecessary information that the gating network needs to process during the sequential supervision and orchestration of the included experts, ultimately resulting in a more efficient and effective task execution. As observed in the reward plot, the highest complexity was undoubtedly presented with the Picking and Inserting expert, requiring the most training in environment steps compared to other experts. This is furthermore evidenced by the qualitative difficulty of obtaining the demonstration data from a human expert which was also the most demanding in regards to effort in this specific sub-task. All experts depicted and used in ROMAN were pre-trained with $N=20$ demonstrations.
Figure 3: The analysis of the MN observations using the t-Distributed Stochastic Neighbour Embedding (t-SNE), with visualised snapshots showing ROMAN's completion of sequential tasks in 2D as well as 3D space. The t-SNE is projecting the 29-dimensional MN state vector into 2 dimensions. Principal Component Analysis (PCA) was used to warm-start the t-SNE projection. Figure (A): The depiction of the state vectors at the start of each of the seven case scenarios, sampled at 1000Hz for 1 s. A grand total of 1000 samples were projected with a perplexity of 400. Figure (B): The illustration of the state vectors during the sequence of actions contained in each case scenario, sampled for the first 1.5 s of each expert sequence. Hence, as these are sampled within the sequence of actions, they appear "trajectory"-like, since the robot and the objects manipulated by it are already in motion during the sampling. A total of 1500 samples were projected with a perplexity of 200. Six out of seven scenario cases are depicted as in practice the S1 case only includes a single expert activation and hence is omitted from the analysis. Figure (C): ROMAN in its initial 2D stage depicting the total five distinct sub-tasks managed by each expert respectively. Figure (D): ROMAN in its final stage in the most complex setting and longest time-horizon sequential tasks.
...and 4 more figures

RObotic MAnipulation Network (ROMAN) -- Hybrid Hierarchical Learning for Solving Complex Sequential Tasks

TL;DR

Abstract

RObotic MAnipulation Network (ROMAN) -- Hybrid Hierarchical Learning for Solving Complex Sequential Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (9)