Bi-CL: A Reinforcement Learning Framework for Robots Coordination Through Bi-level Optimization

Zechen Hu; Daigo Shishika; Xuesu Xiao; Xuan Wang

Bi-CL: A Reinforcement Learning Framework for Robots Coordination Through Bi-level Optimization

Zechen Hu, Daigo Shishika, Xuesu Xiao, Xuan Wang

TL;DR

This work tackles coordinated multi-robot learning under local observations by formulating a bi-level Dec-MDP and proposing Bi-CL, which couples a reduced-action RL upper level with a global optimizer-derived lower level for $y^*$. An imitation-learning lower level $\eta^i$ and an alignment penalty $c_k\sum_i \mathcal{H}_i$ bridge the information gap under CTDE, improving training efficiency and stability. Across route- and graph-based tasks, Bi-CL achieves competitive final performance to MARL baselines while converging faster, highlighting the value of action-space reduction and hierarchical optimization in scalable robot coordination. The approach advances practical coordination by enabling efficient learning with incomplete information and hybrid action spaces in real-world multi-robot systems.

Abstract

In multi-robot systems, achieving coordinated missions remains a significant challenge due to the coupled nature of coordination behaviors and the lack of global information for individual robots. To mitigate these challenges, this paper introduces a novel approach, Bi-level Coordination Learning (Bi-CL), that leverages a bi-level optimization structure within a centralized training and decentralized execution paradigm. Our bi-level reformulation decomposes the original problem into a reinforcement learning level with reduced action space, and an imitation learning level that gains demonstrations from a global optimizer. Both levels contribute to improved learning efficiency and scalability. We note that robots' incomplete information leads to mismatches between the two levels of learning models. To address this, Bi-CL further integrates an alignment penalty mechanism, aiming to minimize the discrepancy between the two levels without degrading their training efficiency. We introduce a running example to conceptualize the problem formulation and apply Bi-CL to two variations of this example: route-based and graph-based scenarios. Simulation results demonstrate that Bi-CL can learn more efficiently and achieve comparable performance with traditional multi-agent reinforcement learning baselines for multi-robot coordination.

Bi-CL: A Reinforcement Learning Framework for Robots Coordination Through Bi-level Optimization

TL;DR

. An imitation-learning lower level

and an alignment penalty

bridge the information gap under CTDE, improving training efficiency and stability. Across route- and graph-based tasks, Bi-CL achieves competitive final performance to MARL baselines while converging faster, highlighting the value of action-space reduction and hierarchical optimization in scalable robot coordination. The approach advances practical coordination by enabling efficient learning with incomplete information and hybrid action spaces in real-world multi-robot systems.

Abstract

Paper Structure (12 sections, 12 equations, 7 figures, 2 tables)

This paper contains 12 sections, 12 equations, 7 figures, 2 tables.

Introduction
Literature Review
MARL for Multi-Robot Coordination
Bi-level Optimization
Preliminaries and Problem Formulation
Formulation of a Bi-level Optimization
Bi-level Formulation for Multi-robot Coordination Learning with Local Robot Observation
Main Approach
Numerical Results
Coordinated Multi-robot Route Traversal
Coordinated Multi-robot Graph Traversal
Conclusion

Figures (7)

Figure 1: A running example for a firefighting scenario. Robots can simultaneously perform two actions: move (to where) and guard (which adversary). Team reward depends on the risk of fire, which is a coupled function of both actions.
Figure 2: A Centralized Bi-level Optimization for RL.
Figure 3: A Bi-level Coordination Learning (Bi-CL) Algorithm: incorporating multi-agent reinforcement learning (MARL) and imitation learning (MAIL), guided by a global optimizer.
Figure 4: Running Example (a): all robots travel along a route.
Figure 5: Comparison of cumulative reward for different alignment penalties with four robots and four adversaries. The height of red dash lines determines implementation performance.
...and 2 more figures

Theorems & Definitions (2)

Remark 1
Remark 2

Bi-CL: A Reinforcement Learning Framework for Robots Coordination Through Bi-level Optimization

TL;DR

Abstract

Bi-CL: A Reinforcement Learning Framework for Robots Coordination Through Bi-level Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (2)