DeepFleet: Multi-Agent Foundation Models for Mobile Robots
Ameya Agaskar, Sriram Siva, William Pickering, Kyle O'Brien, Charles Kekeh, Ang Li, Brianna Gallo Sarker, Alicia Chua, Mayur Nemade, Charun Thattai, Jiaming Di, Isaac Iyengar, Ramya Dharoor, Dino Kirouani, Jimmy Erskine, Tamir Hegazy, Scott Niekum, Usman A. Khan, Federico Pecora, Joseph W. Durham
TL;DR
DeepFleet introduces four multi-agent foundation-model architectures—RC, RF, IF, and GF—trained on production warehouse data to forecast fleet-scale robot movement and planning. The models span different inductive biases, from local ego-centric (RC) and floor-aware (RF) viewpoints to whole-floor (IF) and graph-based (GF) representations, with RC and GF delivering the strongest performance and efficiency. Across extensive evaluations, action-prediction approaches outperformed floor-state prediction, and image-based representations underperformed due to inductive bias mismatches for fleet dynamics. Scaling experiments indicate that increasing model size and dataset size improves performance, with GF providing a tractable scaling curve that informs optimal compute/data mixes; RC shows promising gains with more data pending further study. The practical impact lies in enabling congestion forecasting, adaptive routing, and proactive rescheduling across thousands of warehouse robots, paving the way for more scalable and robust fleet management solutions.
Abstract
We introduce DeepFleet, a suite of foundation models designed to support coordination and planning for large-scale mobile robot fleets. These models are trained on fleet movement data, including robot positions, goals, and interactions, from hundreds of thousands of robots in Amazon warehouses worldwide. DeepFleet consists of four architectures that each embody a distinct inductive bias and collectively explore key points in the design space for multi-agent foundation models: the robot-centric (RC) model is an autoregressive decision transformer operating on neighborhoods of individual robots; the robot-floor (RF) model uses a transformer with cross-attention between robots and the warehouse floor; the image-floor (IF) model applies convolutional encoding to a multi-channel image representation of the full fleet; and the graph-floor (GF) model combines temporal attention with graph neural networks for spatial relationships. In this paper, we describe these models and present our evaluation of the impact of these design choices on prediction task performance. We find that the robot-centric and graph-floor models, which both use asynchronous robot state updates and incorporate the localized structure of robot interactions, show the most promise. We also present experiments that show that these two models can make effective use of larger warehouses operation datasets as the models are scaled up.
