Table of Contents
Fetching ...

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

Wenzhe Li, Yong Lin, Mengzhou Xia, Chi Jin

TL;DR

The paper questions whether cross-model diversity in Mixture-of-Agents (MoA) genuinely improves LLM inference and proposes Self-MoA, a single-model ensemble using repeated sampling to generate diverse outputs. Across benchmarks like AlpacaEval 2.0, MT-Bench, and domain-specific tasks (MMLU, CRUX, MATH), Self-MoA frequently outperforms Mixed-MoA, highlighting that in-model diversity can better navigate the quality-diversity trade-off. It provides a quantitative analysis showing proposer quality has a stronger influence on MoA performance than diversity and introduces Self-MoA-Seq to scale computation under context-length limits. The work also maps scenarios where cross-model mixing may help, offering practical guidance for scalable, high-quality ensemble strategies in LLM inference.

Abstract

Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple different Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial? We propose Self-MoA -- an ensemble method that aggregates outputs from only the single top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves $6.6\%$ improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of $3.8\%$ improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

TL;DR

The paper questions whether cross-model diversity in Mixture-of-Agents (MoA) genuinely improves LLM inference and proposes Self-MoA, a single-model ensemble using repeated sampling to generate diverse outputs. Across benchmarks like AlpacaEval 2.0, MT-Bench, and domain-specific tasks (MMLU, CRUX, MATH), Self-MoA frequently outperforms Mixed-MoA, highlighting that in-model diversity can better navigate the quality-diversity trade-off. It provides a quantitative analysis showing proposer quality has a stronger influence on MoA performance than diversity and introduces Self-MoA-Seq to scale computation under context-length limits. The work also maps scenarios where cross-model mixing may help, offering practical guidance for scalable, high-quality ensemble strategies in LLM inference.

Abstract

Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple different Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial? We propose Self-MoA -- an ensemble method that aggregates outputs from only the single top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.

Paper Structure

This paper contains 33 sections, 5 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Comparison of MoA, Self-MoA, and Self-MoA-Seq. (a) In MoA, multiple models respond to a query, followed by an aggregator synthesizing their outputs. (b) Self-MoA simplifies this by repeatedly sampling from a single model. (c) Self-MoA-Seq extends Self-MoA by applying a sliding window to combine the best output so far with candidate outputs. At each timestep, the synthesized output is repeated to bias the aggregator towards it, reducing the context length requirements and expanding the method's applicability. Note that MoA can extend to multiple rounds of aggregation (Appendix \ref{['sect:multi-layer']}), while Self-MoA and Self-MoA-Seq can extend to more outputs, but we omit them here for clarity.
  • Figure 2: The diversity-quality trade-off: Mixed-MoA incorporates different individual models as proposers, while Self-MoA uses the same individual model for this role. Quality is assessed based on the average performance of each proposer, and diversity is computed with the Vendi Score dan2023vendi of outputs generated by proposers on the same prompts.
  • Figure 3: The performance of Self-MoA and Self-MoA-Seq with a growing number of samples. Dashed lines indicate the performance of a single forward pass with the base model.