When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines

Artem Maryanskyy

When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines

Artem Maryanskyy

Abstract

Multi-agent LLM pipelines produce contradictory evidence on whether team diversity improves output quality: heterogeneous Mixture-of-Agents teams outperform single models, yet homogeneous Self-MoA teams consistently win under synthesis-based aggregation. We propose a resolution by identifying the selection bottleneck -- a crossover threshold in aggregation quality that determines whether diversity helps or hurts. Under this model, we obtain a closed-form crossover threshold $s^*$ (Proposition 1) that separates the regimes where diversity helps and hurts. In a targeted experiment spanning 42 tasks across 7 categories ($N=210$), a diverse team with judge-based selection achieves a win rate of 0.810 against a single-model baseline, while a homogeneous team scores 0.512 -- near chance (Glass's $Δ= 2.07$). Judge-based selection outperforms MoA-style synthesis by $Δ_{\mathrm{WR}} = +0.631$ -- the synthesis approach is preferred over the baseline in zero of 42 tasks by the judge panel. A decoupled evaluation with independent judges confirms all directional findings (Spearman $ρ= 0.90$). Exploratory evidence suggests that including a weaker model improves performance while reducing cost ($p < 10^{-4}$, not pre-registered). Our results suggest that selector quality may be a more impactful design lever than generator diversity in single-round generate-then-select pipelines.

When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines

Abstract

(Proposition 1) that separates the regimes where diversity helps and hurts. In a targeted experiment spanning 42 tasks across 7 categories (

), a diverse team with judge-based selection achieves a win rate of 0.810 against a single-model baseline, while a homogeneous team scores 0.512 -- near chance (Glass's

). Judge-based selection outperforms MoA-style synthesis by

-- the synthesis approach is preferred over the baseline in zero of 42 tasks by the judge panel. A decoupled evaluation with independent judges confirms all directional findings (Spearman

). Exploratory evidence suggests that including a weaker model improves performance while reducing cost (

, not pre-registered). Our results suggest that selector quality may be a more impactful design lever than generator diversity in single-round generate-then-select pipelines.

Paper Structure (38 sections, 1 theorem, 6 equations, 3 figures, 5 tables)

This paper contains 38 sections, 1 theorem, 6 equations, 3 figures, 5 tables.

Introduction
Related Work
Multi-Agent Debate and Frameworks
Mixture-of-Agents and Self-MoA
Selection and Routing
Diversity Theory
LLM-as-Judge
The Gap This Paper Fills
Theoretical Framework
Setup and Notation
The Selection Quality Model
The Selection Bottleneck
Optimal Team Size
Connection to Classical Results
Experimental Setup
...and 23 more sections

Key Result

Proposition 1

Suppose Assumptions ass:linear and ass:homogeneous hold. Let $T_h$ be a homogeneous team with mean quality $\mu_{\mathrm{best}}$, and let $T_d$ be a diverse team satisfying: Then there exists a unique $s^* \in (0,1)$ such that where

Figures (3)

Figure 1: The selection bottleneck.(a) Theoretical output quality $Q(T,s)$ as a function of selector quality $s$. The diverse team (blue) crosses the homogeneous baseline (gray) at the crossover threshold $s^*$ (dashed vertical). Below $s^*$, diversity hurts; above it, diversity helps. (b) Empirical operating regimes from V4 data. Judge-based selection operates well above $s^*$ (WR = 0.810), majority vote sits near $s^*$ (WR = 0.496), and MoA-style synthesis falls far below it (WR = 0.179). The homogeneous baseline (WR = 0.512, dotted) is shown for reference.
Figure 2: BT-corrected consensus win rates for all five experimental cells, averaged over 42 tasks. High performance appears only when diversity and judge-based selection are combined. The synthesis cell (MoA) performs worst, falling well below the single-model baseline.
Figure 3: Per-task diversity advantage ($\Delta$ WR = diverse_strong+judge minus homo_opus+judge) across all 42 tasks. Positive values favor diversity. 38 of 42 tasks show a positive effect (4 ties, 0 negative; Clopper--Pearson 95% CI: [0.774, 0.973]). Error bars are approximate 95% CIs.

Theorems & Definitions (6)

Proposition 1: Crossover Threshold
proof
Remark 1: Nonlinear Generalization
Remark 2: Model Quality Irrelevance
Remark 3
Remark 4: Team Expansion

When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines

Abstract

When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines

Authors

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (6)