X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs
Rui Ye, Xiangrui Liu, Qimin Wu, Xianghe Pang, Zhenfei Yin, Lei Bai, Siheng Chen
TL;DR
The paper investigates heterogeneous LLM-driven multi-agent systems (X-MAS) to overcome the limitations of single-model MAS. It introduces X-MAS-Bench, a comprehensive benchmark that evaluates 27 LLMs across 5 MAS-related functions and 5 domains, culminating in 1.7 million evaluations. The study shows that no single LLM excels across all scenarios and that heterogeneous configurations can markedly boost performance, with gains up to 8.4% in chatbot-only settings and up to 47% in mixed chatbot-reasoner tasks. Building on these insights, X-MAS-Design demonstrates that transitioning from homogeneous to heterogeneous LLM-driven MAS yields consistent improvements across multiple MAS methods and domains without structural redesign, highlighting the value of model diversity for scalable collaborative AI.
Abstract
LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4\% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47\% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.
