SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems

Dong Zhang; Zhaowei Li; Pengyu Wang; Xin Zhang; Yaqian Zhou; Xipeng Qiu

SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems

Dong Zhang, Zhaowei Li, Pengyu Wang, Xin Zhang, Yaqian Zhou, Xipeng Qiu

TL;DR

SpeechAgents addresses the gap in simulating human communication by enabling true multi-modal interactions within a multi-agent system built around a multimodal LLM. The approach combines SpeechGPT as the central controller, a speech-based communication medium, and Multi-Agent Tuning to scale agent collaboration while preserving general abilities. A Human-Communication Simulation Benchmark is introduced to generate diverse scenes, roles, and transcripts, extended to speech via a dedicated vocoder. Experimental results show that SpeechAgents generates consistent, rhythmical, and expressive dialogues up to 25 agents, outperforming text-based baselines on communication tasks and enabling applications such as drama and audio narratives. The work provides an open-source codebase and models to advance research in realistic multi-modal human communication simulation.

Abstract

Human communication is a complex and diverse process that not only involves multiple factors such as language, commonsense, and cultural backgrounds but also requires the participation of multimodal information, such as speech. Large Language Model (LLM)-based multi-agent systems have demonstrated promising performance in simulating human society. Can we leverage LLM-based multi-agent systems to simulate human communication? However, current LLM-based multi-agent systems mainly rely on text as the primary medium. In this paper, we propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication. SpeechAgents utilizes multi-modal LLM as the control center for individual agent and employes multi-modal signals as the medium for exchanged messages among agents. Additionally, we propose Multi-Agent Tuning to enhance the multi-agent capabilities of LLM without compromising general abilities. To strengthen and evaluate the effectiveness of human communication simulation, we build the Human-Communication Simulation Benchmark. Experimental results demonstrate that SpeechAgents can simulate human communication dialogues with consistent content, authentic rhythm, and rich emotions and demonstrate excellent scalability even with up to 25 agents, which can apply to tasks such as drama creation and audio novels generation. Code and models will be open-sourced at https://github. com/0nutation/SpeechAgents

SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems

TL;DR

Abstract

Paper Structure (27 sections, 1 equation, 4 figures, 1 table)

This paper contains 27 sections, 1 equation, 4 figures, 1 table.

Introduction
Related Work
Hmuan-Communication Simulation Benchmark
SpeechAgents
Multi-modal Multi-Agent System
Multi-Agent Tuning
Experiments
Experimental Setups
Baselines
Evaluation
Main Results
Analysis
Ablation Study
Scalability of Agent Numbers
Case Study
...and 12 more sections

Figures (4)

Figure 1: (a) LLM-based Multi-Agent System is built on text-based LLM and rely on text as the medium for information exchange. (b) Multi-modal LLM-based Multi-Agent System is built on multi-modal LLM and rely on multi-modal signals as the medium for information exchange
Figure 2: An overview of Hmuan-Communication Simulation Benchmark construction process. We initiate the process by creating diverse scenes that simulate human communication. Subsequently, a role pool containing various roles is generated for each scene. Roles are then selected from the pool, and communication scripts are generated, depending on the specific scene and roles involved. Ultimately, multi-modal human communication scripts are crafted through text-to-speech conversion.
Figure 3: Illustration of training and inference process of an individual agent in SpeechAgents. The solid arrows represent the data flow during the inference process. During one agent's turn, it receives inputs includes the scene, background, role, profile, and the message stream from the speech message stream banks. The agent's output consists of its inner thoughts, the generated speech response and corresponding style. The response with style is then written to the speech message stream bank. The dashed arrows represent the data flow during the training process. Agent trajectory instructions, parsed from scripts in the Human Communication Simulation Benchmark, are visually represented in the form of the concatenation of agent input and output in the diagram and utilized for multi-agent tuning of the multi-modal LLM.
Figure 4: Consistency and Quality scores of SpeechAgents under Human-Communication scenarios containing different role numbers.

SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems

TL;DR

Abstract

SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (4)