Table of Contents
Fetching ...

Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information

Rui Ai, Yuqi Pan, David Simchi-Levi, Milind Tambe, Haifeng Xu

TL;DR

This work tackles aggregation of multiple LLM answers beyond simple majority voting by exploiting higher-order information. It introduces Optimal Weight (OW), a Bayesian-optimal first-order aggregator with weights derived from per-model accuracies, and Inverse Surprising Popularity (ISP), a second-order rule that leverages cross-model prediction correlations to outperform MV without ground-truth labels. The authors prove theoretical advantages for OW and ISP, compare them to Surprising Popular (SP), and provide practical estimation methods (OW-L, OW-I) for unsupervised settings. Extensive experiments on synthetic data and real-world tasks (UltraFeedback, MMLU, ARMMAN) show ISP and OW-based methods outperform MV in most ensembles, with ISP offering robust gains when K is small and per-question difficulty complicates independence. The extensions to settings with dependent questions demonstrate ISP’s continued efficacy, underscoring its relevance for robust, scalable multi-agent LLM reasoning in diverse applications.

Abstract

With the rapid progress of multi-agent large language model (LLM) reasoning, how to effectively aggregate answers from multiple LLMs has emerged as a fundamental challenge. Standard majority voting treats all answers equally, failing to consider latent heterogeneity and correlation across models. In this work, we design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions, leading to more reliable collective decisions. We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN. Across all cases, our methods consistently outperform majority voting, offering both practical performance gains and conceptual insights for the design of robust multi-agent LLM pipelines.

Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information

TL;DR

This work tackles aggregation of multiple LLM answers beyond simple majority voting by exploiting higher-order information. It introduces Optimal Weight (OW), a Bayesian-optimal first-order aggregator with weights derived from per-model accuracies, and Inverse Surprising Popularity (ISP), a second-order rule that leverages cross-model prediction correlations to outperform MV without ground-truth labels. The authors prove theoretical advantages for OW and ISP, compare them to Surprising Popular (SP), and provide practical estimation methods (OW-L, OW-I) for unsupervised settings. Extensive experiments on synthetic data and real-world tasks (UltraFeedback, MMLU, ARMMAN) show ISP and OW-based methods outperform MV in most ensembles, with ISP offering robust gains when K is small and per-question difficulty complicates independence. The extensions to settings with dependent questions demonstrate ISP’s continued efficacy, underscoring its relevance for robust, scalable multi-agent LLM reasoning in diverse applications.

Abstract

With the rapid progress of multi-agent large language model (LLM) reasoning, how to effectively aggregate answers from multiple LLMs has emerged as a fundamental challenge. Standard majority voting treats all answers equally, failing to consider latent heterogeneity and correlation across models. In this work, we design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions, leading to more reliable collective decisions. We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN. Across all cases, our methods consistently outperform majority voting, offering both practical performance gains and conceptual insights for the design of robust multi-agent LLM pipelines.

Paper Structure

This paper contains 40 sections, 10 theorems, 82 equations, 2 figures, 6 tables, 3 algorithms.

Key Result

Proposition 1

The joint distribution $\mathbb{P}$ satisfies the following properties:

Figures (2)

  • Figure 1: The performance gap between ISP and MV vanishes as $K$ increases.
  • Figure 2: Accuracy of different LLMs across the three datasets.

Theorems & Definitions (15)

  • Proposition 1
  • Theorem 1
  • Corollary 1
  • Corollary 2
  • Proposition 2
  • Example 1
  • Theorem 2
  • Theorem 3
  • Example 2
  • proof : Proof of \ref{['example:pop']}
  • ...and 5 more