Verco: Learning Coordinated Verbal Communication for Multi-agent Reinforcement Learning

Dapeng Li; Hang Dong; Lu Wang; Bo Qiao; Si Qin; Qingwei Lin; Dongmei Zhang; Qi Zhang; Zhiwei Xu; Bin Zhang; Guoliang Fan

Verco: Learning Coordinated Verbal Communication for Multi-agent Reinforcement Learning

Dapeng Li, Hang Dong, Lu Wang, Bo Qiao, Si Qin, Qingwei Lin, Dongmei Zhang, Qi Zhang, Zhiwei Xu, Bin Zhang, Guoliang Fan

TL;DR

Verco addresses the challenge of partial observability and interpretability in multi-agent reinforcement learning by embedding large language models into agents to generate human-readable verbal communication. It uses a teacher-student training flow with Supervised Fine-Tuning on a communication module and LoRA adapters, followed by PPO-based reinforcement learning to align the action policy with environment feedback, all while keeping communication and action components decoupled. Through experiments in the Overcooked environment, Verco demonstrates improved learning efficiency, higher returns, and more interpretable coordination compared to baselines, with concrete verbal messages that reveal collaboration strategies. This work provides a practical framework for interpretable, human-understandable coordination in MARL and opens avenues for deploying communicative AI in real-world cooperative tasks.

Abstract

In recent years, multi-agent reinforcement learning algorithms have made significant advancements in diverse gaming environments, leading to increased interest in the broader application of such techniques. To address the prevalent challenge of partial observability, communication-based algorithms have improved cooperative performance through the sharing of numerical embedding between agents. However, the understanding of the formation of collaborative mechanisms is still very limited, making designing a human-understandable communication mechanism a valuable problem to address. In this paper, we propose a novel multi-agent reinforcement learning algorithm that embeds large language models into agents, endowing them with the ability to generate human-understandable verbal communication. The entire framework has a message module and an action module. The message module is responsible for generating and sending verbal messages to other agents, effectively enhancing information sharing among agents. To further enhance the message module, we employ a teacher model to generate message labels from the global view and update the student model through Supervised Fine-Tuning (SFT). The action module receives messages from other agents and selects actions based on current local observations and received messages. Experiments conducted on the Overcooked game demonstrate our method significantly enhances the learning efficiency and performance of existing methods, while also providing an interpretable tool for humans to understand the process of multi-agent cooperation.

Verco: Learning Coordinated Verbal Communication for Multi-agent Reinforcement Learning

TL;DR

Abstract

Paper Structure (19 sections, 6 equations, 6 figures, 1 algorithm)

This paper contains 19 sections, 6 equations, 6 figures, 1 algorithm.

Introduction
Related Work
LLMs for Decision Making
Finetuning LLMs
MARL with Communication
Preliminaries
Problem Formulation
LLM and Finetuning
Method
Cooperation with verbal communication
Coordination Message Policy Pre-training
Action Policy Alignment
Experiments
Environment description
Baselines
...and 4 more sections

Figures (6)

Figure 1: Incorrect messages can easily lead to conflicts, while coordinated messages can promote efficient cooperation among agents.
Figure 2: Verco framework: We first finetune the LoRA weight of the communication module with the global message label. Then we load the LoRA weight so the agent can directly generate verbal messages with its local observation. Meanwhile, the action policy takes the current local observation and text messages from other agents as input and outputs the decision. The action policy fine-tunes the weights using PPO based on the rewards returned by the environment.
Figure 3: Message module SFT stage: We employ a large model (GPT-4) as the teacher model to generate message samples based on global observations, and distill the learning for a smaller language model (LLaMA-7B) as the communication model $\pi_\eta$.
Figure 4: Experimental environments. Figure \ref{['fig:mapA']} and Figure \ref{['fig:mapB']} show two different maps in Overcooked. Figure \ref{['fig:makedish']} shows the production process of tomato salad.
Figure 5: Results for different maps in Overcooked environment. The first column shows the return curve in each episode (higher is better), the second column shows the length of each episode (lower is better), and the third column shows the curve of policy entropy, demonstrating the uncertainty of the policy in action selection.
...and 1 more figures

Verco: Learning Coordinated Verbal Communication for Multi-agent Reinforcement Learning

TL;DR

Abstract

Verco: Learning Coordinated Verbal Communication for Multi-agent Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)