Task-priority Intermediated Hierarchical Distributed Policies: Reinforcement Learning of Adaptive Multi-robot Cooperative Transport

Yusei Naito; Tomohiko Jimbo; Tadashi Odashima; Takamitsu Matsubara

Task-priority Intermediated Hierarchical Distributed Policies: Reinforcement Learning of Adaptive Multi-robot Cooperative Transport

Yusei Naito, Tomohiko Jimbo, Tadashi Odashima, Takamitsu Matsubara

TL;DR

This work tackles multi-robot cooperative transport under unknown object weights and varying robot/object counts by introducing Task-priority Intermediated Hierarchical Distributed Policies (TIHDP). TIHDP combines a three-layer hierarchy—task allocation, dynamic task priority, and robot control—within a distributed POMDP and uses centralized MAPPO training for coordination with decentralized execution. The dynamic task priority layer enables cross-object coordination via global communication, while the higher and lower layers maintain local observability to preserve scalability as counts change. Experiments in simulation and real-robot demonstrations show TIHDP achieves superior transport performance and robust cooperation compared to baselines, with global communication particularly beneficial for larger, more varied scenarios. The approach promises practical impact for scalable, adaptable multi-robot transport in logistics, housekeeping, and disaster-response tasks.

Abstract

Multi-robot cooperative transport is crucial in logistics, housekeeping, and disaster response. However, it poses significant challenges in environments where objects of various weights are mixed and the number of robots and objects varies. This paper presents Task-priority Intermediated Hierarchical Distributed Policies (TIHDP), a multi-agent Reinforcement Learning (RL) framework that addresses these challenges through a hierarchical policy structure. TIHDP consists of three layers: task allocation policy (higher layer), dynamic task priority (intermediate layer), and robot control policy (lower layer). Whereas the dynamic task priority layer can manipulate the priority of any object to be transported by receiving global object information and communicating with other robots, the task allocation and robot control policies are restricted by local observations/actions so that they are not affected by changes in the number of objects and robots. Through simulations and real-robot demonstrations, TIHDP shows promising adaptability and performance of the learned multi-robot cooperative transport, even in environments with varying numbers of robots and objects. Video is available at https://youtu.be/Rmhv5ovj0xM

Task-priority Intermediated Hierarchical Distributed Policies: Reinforcement Learning of Adaptive Multi-robot Cooperative Transport

TL;DR

Abstract

Paper Structure (28 sections, 9 equations, 6 figures, 3 tables)

This paper contains 28 sections, 9 equations, 6 figures, 3 tables.

INTRODUCTION
RELATED WORK
Combinatorial Optimization
Multi-agent Reinforcement Learning (MARL)
Hierarchical Reinforcement Learning
Problem Formulation
METHOD
Distributed Partially Observable Markov Decision Process
Task-priority Intermediated Hierarchical Distributed Policy Model
Task Allocation Layer
Dynamic Task Priority
Robot Control Layer
Hierarchical Reward Design
Reward for Task Allocation Policy
Reward for Robot Control Policy
...and 13 more sections

Figures (6)

Figure 1: Multi-robot cooperative transport by our method. (a) Robots have high priority on one medium object and cooperate to transport it. (b) Each robot has a high priority for separate objects and transports them independently.
Figure 2: Overview of frameworks. (a) Robots observe globally and select an object. (b) Robots observe locally and select an object from the vicinity. (c) Robots observe locally and update the task priority that possesses global memory.
Figure 3: Task-priority Intermediated Hierarchical Distributed Policy. In robot $i$, the two policies of higher and lower use local observations $\bm{o}^{\text{hi}}_i$ and $\bm{o}^{\text{lo}}_i$. In the intermediate layer, task priorities $\bm{\Phi}_i$ are maintained while conducting global communication and the task priorities fluctuate when a request $\sigma_i(\beta_i)$ and response $d_i(\alpha_i)$ are established. The policy ultimately outputs the robot's control command $\bm{u}_i$.
Figure 4: Simulation environment. Robots, objects, and goals are arranged in a circular pattern from the inside out.
Figure 5: Cumulative rewards of evaluated methods
...and 1 more figures

Task-priority Intermediated Hierarchical Distributed Policies: Reinforcement Learning of Adaptive Multi-robot Cooperative Transport

TL;DR

Abstract

Task-priority Intermediated Hierarchical Distributed Policies: Reinforcement Learning of Adaptive Multi-robot Cooperative Transport

Authors

TL;DR

Abstract

Table of Contents

Figures (6)