Table of Contents
Fetching ...

X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs

Rui Ye, Xiangrui Liu, Qimin Wu, Xianghe Pang, Zhenfei Yin, Lei Bai, Siheng Chen

TL;DR

The paper investigates heterogeneous LLM-driven multi-agent systems (X-MAS) to overcome the limitations of single-model MAS. It introduces X-MAS-Bench, a comprehensive benchmark that evaluates 27 LLMs across 5 MAS-related functions and 5 domains, culminating in 1.7 million evaluations. The study shows that no single LLM excels across all scenarios and that heterogeneous configurations can markedly boost performance, with gains up to 8.4% in chatbot-only settings and up to 47% in mixed chatbot-reasoner tasks. Building on these insights, X-MAS-Design demonstrates that transitioning from homogeneous to heterogeneous LLM-driven MAS yields consistent improvements across multiple MAS methods and domains without structural redesign, highlighting the value of model diversity for scalable collaborative AI.

Abstract

LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4\% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47\% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.

X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs

TL;DR

The paper investigates heterogeneous LLM-driven multi-agent systems (X-MAS) to overcome the limitations of single-model MAS. It introduces X-MAS-Bench, a comprehensive benchmark that evaluates 27 LLMs across 5 MAS-related functions and 5 domains, culminating in 1.7 million evaluations. The study shows that no single LLM excels across all scenarios and that heterogeneous configurations can markedly boost performance, with gains up to 8.4% in chatbot-only settings and up to 47% in mixed chatbot-reasoner tasks. Building on these insights, X-MAS-Design demonstrates that transitioning from homogeneous to heterogeneous LLM-driven MAS yields consistent improvements across multiple MAS methods and domains without structural redesign, highlighting the value of model diversity for scalable collaborative AI.

Abstract

LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4\% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47\% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.

Paper Structure

This paper contains 32 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of our X-MAS-Bench and X-MAS-Design. X-MAS-Bench assesses the capabilities of LLMs in MAS while X-MAS-Design focuses on transitioning a homogeneous MAS to a heterogeneous one, gaining from the observations in X-MAS-Bench. Experiments on chatbot-only and mixed chatbot-reasoner scenarios evidently show the benefits of heterogeneous MAS.
  • Figure 2: Benchmarking chatbot LLMs on 5 MAS-related functions and 5 domains. We see that no single LLM excels across all scenarios, indicating the potential advantages of employing heterogeneous LLMs in MAS. All evaluation results will be open-sourced for future research.
  • Figure 3: Diversity for the win. Experiments are conducted with X-MAS-Proto on three domains. Increasing the number of candidate models generally enhances the system performance, strongly indicating the benefits of LLM heterogeneity for MAS.
  • Figure 4: Comparing X-MAS with LLM selection guided by X-MAS-Bench and arbitrary selection. X-MAS-Design, which is guided by X-MAS-Bench, significantly performs the best.
  • Figure 5: Benchmarking LLMs on 5 MAS-related functions and 5 domains.