Table of Contents
Fetching ...

Multi-Agent Large Language Models for Conversational Task-Solving

Jonas Becker

TL;DR

The paper addresses the limitations of single LLMs by proposing Multi-Agent LLMs for conversational task-solving. It introduces MALLM, a modular framework and taxonomy that decouples agents, discussion paradigms, and decision-making to study how multiple expert personas interact in dialogue. Empirical results show that multi-agent LLMs enhance complex reasoning and ethical alignment, but can underperform on basic tasks due to problem drift and alignment hazards; centralized, information-sharing paradigms can mitigate some safety concerns. The work highlights the tradeoffs between discussion length, task complexity, and agent composition, and offers guidelines and directions for safer, more efficient multi-agent AI systems. Overall, it provides a rigorous foundation for evaluating how multi-agent interactions influence performance across generative and QA tasks and points to future research on safety, fairness, and scalable deployment.

Abstract

In an era where single large language models have dominated the landscape of artificial intelligence for years, multi-agent systems arise as new protagonists in conversational task-solving. While previous studies have showcased their potential in reasoning tasks and creative endeavors, an analysis of their limitations concerning the conversational paradigms and the impact of individual agents is missing. It remains unascertained how multi-agent discussions perform across tasks of varying complexity and how the structure of these conversations influences the process. To fill that gap, this work systematically evaluates multi-agent systems across various discussion paradigms, assessing their strengths and weaknesses in both generative tasks and question-answering tasks. Alongside the experiments, I propose a taxonomy of 20 multi-agent research studies from 2022 to 2024, followed by the introduction of a framework for deploying multi-agent LLMs in conversational task-solving. I demonstrate that while multi-agent systems excel in complex reasoning tasks, outperforming a single model by leveraging expert personas, they fail on basic tasks. Concretely, I identify three challenges that arise: 1) While longer discussions enhance reasoning, agents fail to maintain conformity to strict task requirements, which leads to problem drift, making shorter conversations more effective for basic tasks. 2) Prolonged discussions risk alignment collapse, raising new safety concerns for these systems. 3) I showcase discussion monopolization through long generations, posing the problem of fairness in decision-making for tasks like summarization. This work uncovers both the potential and challenges that arise with multi-agent interaction and varying conversational paradigms, providing insights into how future research could improve the efficiency, performance, and safety of multi-agent LLMs.

Multi-Agent Large Language Models for Conversational Task-Solving

TL;DR

The paper addresses the limitations of single LLMs by proposing Multi-Agent LLMs for conversational task-solving. It introduces MALLM, a modular framework and taxonomy that decouples agents, discussion paradigms, and decision-making to study how multiple expert personas interact in dialogue. Empirical results show that multi-agent LLMs enhance complex reasoning and ethical alignment, but can underperform on basic tasks due to problem drift and alignment hazards; centralized, information-sharing paradigms can mitigate some safety concerns. The work highlights the tradeoffs between discussion length, task complexity, and agent composition, and offers guidelines and directions for safer, more efficient multi-agent AI systems. Overall, it provides a rigorous foundation for evaluating how multi-agent interactions influence performance across generative and QA tasks and points to future research on safety, fairness, and scalable deployment.

Abstract

In an era where single large language models have dominated the landscape of artificial intelligence for years, multi-agent systems arise as new protagonists in conversational task-solving. While previous studies have showcased their potential in reasoning tasks and creative endeavors, an analysis of their limitations concerning the conversational paradigms and the impact of individual agents is missing. It remains unascertained how multi-agent discussions perform across tasks of varying complexity and how the structure of these conversations influences the process. To fill that gap, this work systematically evaluates multi-agent systems across various discussion paradigms, assessing their strengths and weaknesses in both generative tasks and question-answering tasks. Alongside the experiments, I propose a taxonomy of 20 multi-agent research studies from 2022 to 2024, followed by the introduction of a framework for deploying multi-agent LLMs in conversational task-solving. I demonstrate that while multi-agent systems excel in complex reasoning tasks, outperforming a single model by leveraging expert personas, they fail on basic tasks. Concretely, I identify three challenges that arise: 1) While longer discussions enhance reasoning, agents fail to maintain conformity to strict task requirements, which leads to problem drift, making shorter conversations more effective for basic tasks. 2) Prolonged discussions risk alignment collapse, raising new safety concerns for these systems. 3) I showcase discussion monopolization through long generations, posing the problem of fairness in decision-making for tasks like summarization. This work uncovers both the potential and challenges that arise with multi-agent interaction and varying conversational paradigms, providing insights into how future research could improve the efficiency, performance, and safety of multi-agent LLMs.

Paper Structure

This paper contains 56 sections, 1 equation, 29 figures, 18 tables.

Figures (29)

  • Figure 1: A superficial view on MALLM: Multi-Agent Large Language Models, compared with Chain-of-Thought WeiWSB23 for a single model. MALLM comprises three main components: automated persona assignment, collaborative discussion, and decision-making. A more technical overview can be seen in \ref{['fig:mallm_functionality']}.
  • Figure 2: Taxonomy of Multi-Agent LLMs for conversational problem-solving. Underlined nodes indicate what is relevant to our experiments. For an explanation of all the components, please refer to \ref{['sec:taxonomy']}.
  • Figure 3: Functionality of MALLM applied to my experiments. First, MALLM automatically determines three personas. Each persona then contributes to multi-agent discussion under one of four paradigms (structural communication schemes). After each contribution, a decision-making mechanism checks if a consensus is reached.
  • Figure 4: Accuracy on (a) the Simple Ethical Questions dataset and (b) the StrategyQA dataset. Error bars are the standard deviation between five runs.
  • Figure 5: Number of exchanged messages before agents reach a consensus on (a) XSum and (b) Simple Ethical Questions. All results of the five experiment runs are combined for this figure.
  • ...and 24 more figures