Table of Contents
Fetching ...

Problem-Solving in Language Model Networks

Ciaran Regan, Alexandre Gournail, Mizuki Oka

TL;DR

This study extends multi-agent debate to graph-based network topologies to assess how network structure, self-reflection, and bias affect QA performance in language-model agents. By comparing scale-free, random, fully connected, and fully disconnected networks using four rounds of debate on 100 MMLU math questions with GPT-3.5-Turbo, the work shows that random networks match fully connected performance while using far fewer tokens, and that hub-centered bias can dramatically alter outcomes. The analysis reveals that strong consensus among agents often coincides with correct answers, while disagreements correlate with incorrect results, highlighting consensus as a proxy for uncertainty. These findings inform scalable design choices for collaborative AI systems, suggesting cost-effective random topologies or strategically hub-centric scale-free networks and using consensus metrics to gauge confidence in collective decisions.

Abstract

To improve the reasoning and question-answering capabilities of Large Language Models (LLMs), several multi-agent approaches have been introduced. While these methods enhance performance, the application of collective intelligence-based approaches to complex network structures and the dynamics of agent interactions remain underexplored. This work extends the concept of multi-agent debate to more general network topologies, measuring the question-answering accuracy, influence, consensus, and the effects of bias on the collective. The results show that random networks perform similarly to fully connected networks despite using significantly fewer tokens. Furthermore, a strong consensus among agents correlates with correct answers, whereas divided responses typically indicate incorrect answers. Analysing the influence of the agents reveals a balance between self-reflection and interconnectedness; self-reflection aids when local interactions are incorrect, and local interactions aid when the agent itself is incorrect. Additionally, bias plays a strong role in system performance with correctly biased hub nodes boosting performance. These insights suggest that using random networks or scale-free networks with knowledgeable agents placed in central positions can enhance the overall question-answering performance of multi-agent systems.

Problem-Solving in Language Model Networks

TL;DR

This study extends multi-agent debate to graph-based network topologies to assess how network structure, self-reflection, and bias affect QA performance in language-model agents. By comparing scale-free, random, fully connected, and fully disconnected networks using four rounds of debate on 100 MMLU math questions with GPT-3.5-Turbo, the work shows that random networks match fully connected performance while using far fewer tokens, and that hub-centered bias can dramatically alter outcomes. The analysis reveals that strong consensus among agents often coincides with correct answers, while disagreements correlate with incorrect results, highlighting consensus as a proxy for uncertainty. These findings inform scalable design choices for collaborative AI systems, suggesting cost-effective random topologies or strategically hub-centric scale-free networks and using consensus metrics to gauge confidence in collective decisions.

Abstract

To improve the reasoning and question-answering capabilities of Large Language Models (LLMs), several multi-agent approaches have been introduced. While these methods enhance performance, the application of collective intelligence-based approaches to complex network structures and the dynamics of agent interactions remain underexplored. This work extends the concept of multi-agent debate to more general network topologies, measuring the question-answering accuracy, influence, consensus, and the effects of bias on the collective. The results show that random networks perform similarly to fully connected networks despite using significantly fewer tokens. Furthermore, a strong consensus among agents correlates with correct answers, whereas divided responses typically indicate incorrect answers. Analysing the influence of the agents reveals a balance between self-reflection and interconnectedness; self-reflection aids when local interactions are incorrect, and local interactions aid when the agent itself is incorrect. Additionally, bias plays a strong role in system performance with correctly biased hub nodes boosting performance. These insights suggest that using random networks or scale-free networks with knowledgeable agents placed in central positions can enhance the overall question-answering performance of multi-agent systems.
Paper Structure (11 sections, 1 equation, 10 figures, 3 tables)

This paper contains 11 sections, 1 equation, 10 figures, 3 tables.

Figures (10)

  • Figure 1: An overview of multi-agent debate on networks. Each node represents an agent and each edge represents a communication channel between agents, with self-loops indicating agent self-reflection. In the first round, each agent answers the question individually, with all the agents getting the answer incorrect. In the second round, agent 2 gets the answer correct through self-reflection. This correct answer then spreads through the network in subsequent rounds of debate. After the last round of debate, "C" is taken as the final answer of the system as this is the most common answer.
  • Figure 2: The two-part prompt used for question answering. Subfigure (a) presents the initial prompt for agents to solve the problem independently. Subfigure (b) introduces the second stage, where agents are asked to re-evaluate their response after considering peer feedback and their previous response.
  • Figure 3: The prompt used to generate reasoning for biased agents, where $\texttt{biased\_answer}$ is either the correct answer for correctly biased agents or an incorrect answer for incorrectly biased agents.
  • Figure 4: The specific scale-free and random networks used in the experiments.
  • Figure 5: Accuracy per round of debate for different types of networks.
  • ...and 5 more figures