Distributed Online Rollout for Multivehicle Routing in Unmapped Environments

Jamison W. Weber; Dhanush R. Giriyan; Devendra R. Parkar; Dimitri P. Bertsekas; Andréa W. Richa

Distributed Online Rollout for Multivehicle Routing in Unmapped Environments

Jamison W. Weber, Dhanush R. Giriyan, Devendra R. Parkar, Dimitri P. Bertsekas, Andréa W. Richa

TL;DR

The paper addresses unmapped multivehicle routing with local sensing and no central controller (UMVRP-L) by introducing Decentralized Multiagent Rollout (DMAR), a fully distributed online algorithm that forms constant-size agent clusters and solves local rollout problems within them. DMAR relies on three phases—Self-Organizing Agent Clusters, Local Map Aggregation, and Team-Restricted Multiagent Rollout—to achieve scalable coordination without global topology knowledge, and it proves probabilistic completeness with an expected $O(N^2)$ number of rounds. Empirically, there exists a critical sensing radius around $\log_2^*(N)$ where rollout begins to outperform a greedy base policy, with an effective radius range $[2\log_2^*(N),3\log_2^*(N)]$ yielding about a factor of two improvement in movement costs and substantial compute savings versus centralized rollout. The approach is validated through extensive discrete simulations and physical robot experiments on the Robotarium, demonstrating robustness to sensor noise and practicality for real-world unmapped environments.

Abstract

In this work we consider a generalization of the well-known multivehicle routing problem: given a network, a set of agents occupying a subset of its nodes, and a set of tasks, we seek a minimum cost sequence of movements subject to the constraint that each task is visited by some agent at least once. The classical version of this problem assumes a central computational server that observes the entire state of the system perfectly and directs individual agents according to a centralized control scheme. In contrast, we assume that there is no centralized server and that each agent is an individual processor with no a priori knowledge of the underlying network (including task and agent locations). Moreover, our agents possess strictly local communication and sensing capabilities (restricted to a fixed radius around their respective locations), aligning more closely with several real-world multiagent applications. These restrictions introduce many challenges that are overcome through local information sharing and direct coordination between agents. We present a fully distributed, online, and scalable reinforcement learning algorithm for this problem whereby agents self-organize into local clusters and independently apply a multiagent rollout scheme locally to each cluster. We demonstrate empirically via extensive simulations that there exists a critical sensing radius beyond which the distributed rollout algorithm begins to improve over a greedy base policy. This critical sensing radius grows proportionally to the $\log^*$ function of the size of the network, and is, therefore, a small constant for any relevant network. Our decentralized reinforcement learning algorithm achieves approximately a factor of two cost improvement over the base policy for a range of radii bounded from below and above by two and three times the critical sensing radius, respectively.

Distributed Online Rollout for Multivehicle Routing in Unmapped Environments

TL;DR

number of rounds. Empirically, there exists a critical sensing radius around

where rollout begins to outperform a greedy base policy, with an effective radius range

yielding about a factor of two improvement in movement costs and substantial compute savings versus centralized rollout. The approach is validated through extensive discrete simulations and physical robot experiments on the Robotarium, demonstrating robustness to sensor noise and practicality for real-world unmapped environments.

Abstract

function of the size of the network, and is, therefore, a small constant for any relevant network. Our decentralized reinforcement learning algorithm achieves approximately a factor of two cost improvement over the base policy for a range of radii bounded from below and above by two and three times the critical sensing radius, respectively.

Paper Structure (19 sections, 6 theorems, 1 equation, 7 figures, 2 algorithms)

This paper contains 19 sections, 6 theorems, 1 equation, 7 figures, 2 algorithms.

Introduction
Our Contributions
MVRP and Other Related Work
Model and preliminaries
Rollout and Multiagent Rollout
Decentralized Multiagent Rollout
Self-Organizing Agent Clusters (Algorithm \ref{['alg:overview']}, Lines \ref{['alg:soac-start']}-\ref{['alg:soac-end']})
Local Map Aggregation (Algorithm \ref{['alg:overview']}, Lines \ref{['alg:lma-start']}-\ref{['alg:lma-end']})
Team-Restricted Multiagent Rollout and Execute Movement (Algorithm \ref{['alg:overview']}, Lines \ref{['alg:TMAR-start']}-\ref{['alg:EM-end']})
Experimental results
Physics-based Simulations and Robotics Experiments
Discussion
Alternative Views
DMAR variant with enforced cost-improvement property
Comprehensive simulation results
...and 4 more sections

Key Result

Lemma 3.1

Let $K\subseteq S$ be an arbitrary cluster of agents in $G$ obtained at the end of a round of SOAC. Then (i) $K$ forms an agent tree $\mathcal{T}_K$ rooted at a leader $\ell$ with height at most $\mathcal{L}(\psi)=\mathcal{O}(\psi)$ and $|\mathcal{T}_K|= c^{\mathcal{O}(\mathcal{L}(\psi))}$ agents; (

Figures (7)

Figure 1: SOAC Execution. Read left to right, top to bottom: Cars represent agents. Gray crosses are tasks. Agent colors indicate cluster membership; gray agents are in no cluster. IDs are indicated below each agent. Flags indicate which agents see tasks. Assume $k=2$, $\psi=4$. (top left) No agent is in a cluster; 1,2,3,5,7 see tasks. (top right) 1,2,5 see agents with larger IDs who see tasks, so disqualify themselves. 3,7 become leaders of new clusters. (middle left) Iteration 1 begins. 2,4,5 join clusters, colored arrows indicate tree pointers. 8 sees agents in different clusters. 8 declares self new leader and initiates a join with messages. (middle right) 3 and 7 join 8's cluster. Join messages are propagated to 2,4,5 (bottom left) 2,4,5 join the green cluster. (bottom right) Iteration 2 begins. 1,6 join green cluster. 9 was too distant so joins no cluster.
Figure 2: (Rows 1-3, left) Left vertical axes show the average cost of greedy-exploration base policy vs average cost of DMAR. Right vertical axes show average running time in seconds for DMAR and base policy. Critical radii are marked as $k^*$; shaded orange boxes show effective ranges. (Rows 1-3, right) Show average number of clusters from base policy vs those from DMAR. 95% confidence intervals are shown by shaded regions around respective means. (Bottom) Sampled critical radii function $k^*(N)$.
Figure 3: (Left) Physics-based simulation on 2$\times$2 environments. Left vertical axis shows average solution cost, right axis shows average number of clusters. 95% confidence intervals are given by shaded regions. (Right) DMAR execution in Robotarium. Small dots are tasks, boxes are obstacles. Star, diamond, hexagon indicate cluster membership.
Figure 4: Results for a $k$-hop-reduced view definition on $40\times40, 60\times60$ grids. (a,c) Left vertical axis shows average cost of greedy-exploration base policy (red) vs average cost of DMAR (blue). Right vertical axis shows average running time in seconds for DMAR and base policy. Critical radius is marked as $k^*$; shaded orange boxes show effective ranges. (b,d) Show average number of clusters from base policy (red) vs those from DMAR (blue). 95% confidence intervals shown by shaded regions around respective means.
Figure 5: Experimental results. (a-h, left vertical axis) Discrete-space simulation plots show the average cost of the greedy-exploration base policy (red curves) vs average cost of DMAR (blue curves) on $10\times10$, $20\times20$, $\ldots$, $80\times80$ grids. (a-h, right vertical axis). Curves correspond to average running time in seconds for both DMAR and the base policy. Critical radii are marked as $k^*$ are marked, effective ranges are represented by the shaded orange boxes. 95% confidence intervals are given by shaded regions around their respective means.
...and 2 more figures

Theorems & Definitions (6)

Lemma 3.1
Lemma 3.2
Theorem 3.3
Theorem 3.4
Theorem 3.5
Theorem B.1

Distributed Online Rollout for Multivehicle Routing in Unmapped Environments

TL;DR

Abstract

Distributed Online Rollout for Multivehicle Routing in Unmapped Environments

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (6)