Table of Contents
Fetching ...

Language-Conditioned Offline RL for Multi-Robot Navigation

Steven Morad, Ajay Shankar, Jan Blumenkamp, Amanda Prorok

TL;DR

This paper addresses natural language–driven navigation for multi-robot teams by conditioning low-latency control policies on embeddings from pretrained LLMs and training exclusively on offline real-world data. It introduces a two-stage approach: (1) collect a single-robot dataset and (2) generate a massive combinatorial multi-agent dataset virtually, enabling offline MARL without simulators. By reframing Q-learning with an offline Expected SARSA objective and evaluating multiple variants (Mean Q, Soft Q, and CQL), the authors find that safer, data-grounded objectives yield robust generalization to unseen commands and stable real-world deployment. Real-robot experiments with up to five agents show generalization to novel instructions, low control latency, and negligible collisions, highlighting the practical potential for language-conditioned, offline-trained multi-robot systems without finetuning.

Abstract

We present a method for developing navigation policies for multi-robot teams that interpret and follow natural language instructions. We condition these policies on embeddings from pretrained Large Language Models (LLMs), and train them via offline reinforcement learning with as little as 20 minutes of randomly-collected data. Experiments on a team of five real robots show that these policies generalize well to unseen commands, indicating an understanding of the LLM latent space. Our method requires no simulators or environment models, and produces low-latency control policies that can be deployed directly to real robots without finetuning. We provide videos of our experiments at https://sites.google.com/view/llm-marl.

Language-Conditioned Offline RL for Multi-Robot Navigation

TL;DR

This paper addresses natural language–driven navigation for multi-robot teams by conditioning low-latency control policies on embeddings from pretrained LLMs and training exclusively on offline real-world data. It introduces a two-stage approach: (1) collect a single-robot dataset and (2) generate a massive combinatorial multi-agent dataset virtually, enabling offline MARL without simulators. By reframing Q-learning with an offline Expected SARSA objective and evaluating multiple variants (Mean Q, Soft Q, and CQL), the authors find that safer, data-grounded objectives yield robust generalization to unseen commands and stable real-world deployment. Real-robot experiments with up to five agents show generalization to novel instructions, low control latency, and negligible collisions, highlighting the practical potential for language-conditioned, offline-trained multi-robot systems without finetuning.

Abstract

We present a method for developing navigation policies for multi-robot teams that interpret and follow natural language instructions. We condition these policies on embeddings from pretrained Large Language Models (LLMs), and train them via offline reinforcement learning with as little as 20 minutes of randomly-collected data. Experiments on a team of five real robots show that these policies generalize well to unseen commands, indicating an understanding of the LLM latent space. Our method requires no simulators or environment models, and produces low-latency control policies that can be deployed directly to real robots without finetuning. We provide videos of our experiments at https://sites.google.com/view/llm-marl.
Paper Structure (38 sections, 14 equations, 10 figures, 2 tables)

This paper contains 38 sections, 14 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Our learned policies demonstrate emergent path deconfliction while following natural-language tasks. Agents, tasks, and goals are color-coordinated. Each agent (hollow disk) receives a natural language task (top), and must navigate to the goal (filled disk, rendered for visualisation only), leaving behind a colored trail. All five agents are in motion, but we highlight tasks and goals for three they execute a three-way yield. (Left) The red, purple, and green agents block each other from reaching their respective goals. (Middle) Green moves towards its goal and purple yields to green by moving north, and red yields to purple by remaining stationary. (Right) After green completes its task, purple and red complete their tasks.
  • Figure 2: Our proposed multi-robot model architecture. Each agent receives a different natural language task and a local observation. We summarize each natural language task $g_i$ into a latent representation $z_i$, using an LLM $\phi$. The function $f$ is a graph neural network that encodes local observations $o_1, o_2, \dots$ and task embeddings $z_1, z_2, \dots$ into a task-dependent state representation $s_i | z$ for each agent $i$. We learn a local policy $\pi$ conditioned on the state-task representation. Functions $\pi, f$ are learned entirely from a fixed dataset using offline RL. Because we compute $z_i$ only once per task, the LLM is not part of the perception-action loop, allowing the policy to act quickly.
  • Figure 3: (Left) A comparison of LLMs li_angle-optimized_2023feng_language-agnostic_2022muennighoff_mteb_2023lee_open_2024reimers_sentence-bert_2019li_towards_2023jiang_mistral_2023wang_improving_2024 used for feature extraction. Our decoder only generalizes to certain LLM latent spaces (note the log scale y-axis). (Right) Control latency (observation to action) for different team sizes, tested on a 2020 MacBook Air CPU. Our policies map perception to action much faster than those with an LLM in the perception-action loop, which can often take seconds to produce each action.
  • Figure 4: (Left Two) We compare the best CQL and Soft Q variants to Max Q and Mean Q objectives. Soft Q and Mean Q perform best. (Rightmost) We experiment with how much data collection is necessary to train a sufficient policy, using the Mean Q objective.
  • Figure 5: Evaluations from real-world multi-agent navigation tests, where each agent is provided a new task every 30s. We plot their distance from the goal (averaged over agents) as they navigate. The blue line represents tasks the agent has seen before, while the orange line represents unseen tasks. Our results show that our agents are able to generalize to unseen, out-of-distribution tasks. We consider success as reaching a mean distance $< 25cm$ (red line) from the goal regions. For three agents, the Mean Q policy solves 9/10 train tasks and 8/10 test tasks while the Soft Q policy solves 10/10 train tasks and 9/10 test tasks. For five agents, the Soft Q policy solves 19/20 train tasks and 18/20 test tasks. See \ref{['sec:appendix_experiments']} for other objectives.
  • ...and 5 more figures