Toward Scalable Multirobot Control: Fast Policy Learning in Distributed MPC

Xinglong Zhang; Wei Pan; Cong Li; Xin Xu; Xiangke Wang; Ronghua Zhang; Dewen Hu

Toward Scalable Multirobot Control: Fast Policy Learning in Distributed MPC

Xinglong Zhang, Wei Pan, Cong Li, Xin Xu, Xiangke Wang, Ronghua Zhang, Dewen Hu

TL;DR

The paper tackles the challenge of scalable, real-time cooperative control for large-scale multirobot systems by replacing computationally intensive online NLP solvers with a distributed policy-learning approach that yields explicit closed-loop DMPC policies. It introduces DLPC, a distributed online actor-critic framework that updates policies forward in time within each prediction interval, while guaranteeing stability through a practical Lyapunov-based condition and safety via a force-field-inspired barrier design. The approach demonstrates strong scalability, achieving online policy learning for up to $10{,}000$ robots with linear growth in computational load, and shows successful sim-to-real transfer in both drone and wheeled robot experiments. These results indicate a significant advancement toward fast, scalable, and safe optimization-based control for large multirobot systems, with promising directions for model-free extensions and time-varying networks.

Abstract

Distributed model predictive control (DMPC) is promising in achieving optimal cooperative control in multirobot systems (MRS). However, real-time DMPC implementation relies on numerical optimization tools to periodically calculate local control sequences online. This process is computationally demanding and lacks scalability for large-scale, nonlinear MRS. This article proposes a novel distributed learning-based predictive control (DLPC) framework for scalable multirobot control. Unlike conventional DMPC methods that calculate open-loop control sequences, our approach centers around a computationally fast and efficient distributed policy learning algorithm that generates explicit closed-loop DMPC policies for MRS without using numerical solvers. The policy learning is executed incrementally and forward in time in each prediction interval through an online distributed actor-critic implementation. The control policies are successively updated in a receding-horizon manner, enabling fast and efficient policy learning with the closed-loop stability guarantee. The learned control policies could be deployed online to MRS with varying robot scales, enhancing scalability and transferability for large-scale MRS. Furthermore, we extend our methodology to address the multirobot safe learning challenge through a force field-inspired policy learning approach. We validate our approach's effectiveness, scalability, and efficiency through extensive experiments on cooperative tasks of large-scale wheeled robots and multirotor drones. Our results demonstrate the rapid learning and deployment of DMPC policies for MRS with scales up to 10,000 units.

Toward Scalable Multirobot Control: Fast Policy Learning in Distributed MPC

TL;DR

robots with linear growth in computational load, and shows successful sim-to-real transfer in both drone and wheeled robot experiments. These results indicate a significant advancement toward fast, scalable, and safe optimization-based control for large multirobot systems, with promising directions for model-free extensions and time-varying networks.

Abstract

Paper Structure (29 sections, 10 theorems, 100 equations, 19 figures, 4 tables, 2 algorithms)

This paper contains 29 sections, 10 theorems, 100 equations, 19 figures, 4 tables, 2 algorithms.

Introduction
Related Work
Control Problem Formulation
Dynamical Models of MRS
Distributed MPC for MRS
Fast Policy Learning Framework for DMPC
Policy Learning Design for DMPC
Distributed Online Actor-Critic Learning Implementation
Practical Stability Verification Condition
Safe Policy Learning
Force field-inspired Policy Learning Design
Distributed Safe Actor-Critic Learning Implementation
Simulation and Experimental Results
Simulated Experiments on MRS
Policy Deployment to Multirotor Drones in Gazebo
...and 14 more sections

Key Result

Theorem 1

Let $\bm u^0(k)$ be an initial policy and the initial value function $J^0(e(\tau))\geq r(e(\tau),u^0(\tau))+ J^{0}(e(\tau+1))$, $\tau\in[k,k+N-1]$; then under iteration Eqn:safempc-o, it holds that

Figures (19)

Figure 1: The motivational problem. A: In nonlinear DMPC, the optimization problems are usually solved through nonlinear programming (NLP) solvers, which are computationally intensive and non-scalable, especially for nonlinear MRS with large scales. B: Our approach generates the closed-loop DMPC policies for MRS through distributed policy learning, and the learned policies are composed of parameterized functions that could be online trained and deployed with robot scales up to 10,000.
Figure 2: An exemplary scenario of communication graph with $M=6$. The arrows represent the directions of information exchange among robots. For the first robot, the set of its neighbors (including itself) is $\mathcal{N}_1=\{1,2,5\}$, while the set of robots that include robot 1 as one of the neighbors is $\bar{\mathcal{N}}_1=\{1,6\}$. The communications are instantaneously exchanged among neighboring robots at each step.
Figure 3: A: A sketch diagram of the distributed actor-critic learning algorithm in the prediction interval $[k,k+N-1]$, for the formation control of wheeled vehicles or multirotor drones. The definitions of $\lambda_i^d$ and $u_{o,i}^d$ are given in \ref{['Eqn:lam_d-o']} and \ref{['Eqn:act-d-o']}. B: The learned control policy is of an explicit structure, and the one generated with 2 robots could be online deployed to 1,000 robots via weight sharing (see Section \ref{['sec:simu']} for implementing details).
Figure 4: A: An example of the practical stability verification condition. The purple line represents the cost value using the distributed actor-critic learning algorithm, which may not be monotonically decreasing but is bounded by two monotonically decreasing cost values $J^b(k)$ under two baseline stabilizing control policies. B: An example of the relaxed barrier function for $\mathcal{B}^o_{z,i}(z_i)=-\text{log}(b-z_i)-\text{log}(b+z_i)$, where the black dotted line represents $\delta_i(z_i,\bar{\sigma}_i)$ in \ref{['Eqn:relaxed_B']}, while the blue line represents the recentered transformation ${\mathcal{B}}_{z,i}^c(z_i)=\mathcal{B}^o_{z,i}(z_i)+2\text{log}b$ centered at $z_{c,i}=0$.
Figure 5: A: Online policy learning with robot scales up to 10,000, where $r_i(k)=\| e_{ \mathcal{N}_i}(k)\|_{Q_i}^{2}+\| u_i(k)\|_{R_i}^{2}.$ B: Transferred performance of straight-line formation of 2 robots to the circular formation of 2 robots and different formation scenarios of 4, 200, and 1,000 robots. Note that, "two robots" actually pertains to two follower robots and a leader. The leader adopted in this work is a virtual entity, which is not counted in the total number of robots.
...and 14 more figures

Theorems & Definitions (18)

Remark 1
Remark 2
Remark 3
Remark 4
Definition 1: Barrier functions wills2004barrier
Remark 5
Theorem 1: Convergence
Theorem 2: Closed-loop stability
Theorem 3: Convergence of Algorithm \ref{['alg:d-lpc-AC-o']}
Theorem 4: Closed-loop stability under Algorithm \ref{['alg:d-lpc-AC-o']}
...and 8 more

Toward Scalable Multirobot Control: Fast Policy Learning in Distributed MPC

TL;DR

Abstract

Toward Scalable Multirobot Control: Fast Policy Learning in Distributed MPC

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (18)