Table of Contents
Fetching ...

When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines

Artem Maryanskyy

Abstract

Multi-agent LLM pipelines produce contradictory evidence on whether team diversity improves output quality: heterogeneous Mixture-of-Agents teams outperform single models, yet homogeneous Self-MoA teams consistently win under synthesis-based aggregation. We propose a resolution by identifying the selection bottleneck -- a crossover threshold in aggregation quality that determines whether diversity helps or hurts. Under this model, we obtain a closed-form crossover threshold $s^*$ (Proposition 1) that separates the regimes where diversity helps and hurts. In a targeted experiment spanning 42 tasks across 7 categories ($N=210$), a diverse team with judge-based selection achieves a win rate of 0.810 against a single-model baseline, while a homogeneous team scores 0.512 -- near chance (Glass's $Δ= 2.07$). Judge-based selection outperforms MoA-style synthesis by $Δ_{\mathrm{WR}} = +0.631$ -- the synthesis approach is preferred over the baseline in zero of 42 tasks by the judge panel. A decoupled evaluation with independent judges confirms all directional findings (Spearman $ρ= 0.90$). Exploratory evidence suggests that including a weaker model improves performance while reducing cost ($p < 10^{-4}$, not pre-registered). Our results suggest that selector quality may be a more impactful design lever than generator diversity in single-round generate-then-select pipelines.

When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines

Abstract

Multi-agent LLM pipelines produce contradictory evidence on whether team diversity improves output quality: heterogeneous Mixture-of-Agents teams outperform single models, yet homogeneous Self-MoA teams consistently win under synthesis-based aggregation. We propose a resolution by identifying the selection bottleneck -- a crossover threshold in aggregation quality that determines whether diversity helps or hurts. Under this model, we obtain a closed-form crossover threshold (Proposition 1) that separates the regimes where diversity helps and hurts. In a targeted experiment spanning 42 tasks across 7 categories (), a diverse team with judge-based selection achieves a win rate of 0.810 against a single-model baseline, while a homogeneous team scores 0.512 -- near chance (Glass's ). Judge-based selection outperforms MoA-style synthesis by -- the synthesis approach is preferred over the baseline in zero of 42 tasks by the judge panel. A decoupled evaluation with independent judges confirms all directional findings (Spearman ). Exploratory evidence suggests that including a weaker model improves performance while reducing cost (, not pre-registered). Our results suggest that selector quality may be a more impactful design lever than generator diversity in single-round generate-then-select pipelines.
Paper Structure (38 sections, 1 theorem, 6 equations, 3 figures, 5 tables)

This paper contains 38 sections, 1 theorem, 6 equations, 3 figures, 5 tables.

Key Result

Proposition 1

Suppose Assumptions ass:linear and ass:homogeneous hold. Let $T_h$ be a homogeneous team with mean quality $\mu_{\mathrm{best}}$, and let $T_d$ be a diverse team satisfying: Then there exists a unique $s^* \in (0,1)$ such that where

Figures (3)

  • Figure 1: The selection bottleneck.(a) Theoretical output quality $Q(T,s)$ as a function of selector quality $s$. The diverse team (blue) crosses the homogeneous baseline (gray) at the crossover threshold $s^*$ (dashed vertical). Below $s^*$, diversity hurts; above it, diversity helps. (b) Empirical operating regimes from V4 data. Judge-based selection operates well above $s^*$ (WR = 0.810), majority vote sits near $s^*$ (WR = 0.496), and MoA-style synthesis falls far below it (WR = 0.179). The homogeneous baseline (WR = 0.512, dotted) is shown for reference.
  • Figure 2: BT-corrected consensus win rates for all five experimental cells, averaged over 42 tasks. High performance appears only when diversity and judge-based selection are combined. The synthesis cell (MoA) performs worst, falling well below the single-model baseline.
  • Figure 3: Per-task diversity advantage ($\Delta$ WR = diverse_strong+judge minus homo_opus+judge) across all 42 tasks. Positive values favor diversity. 38 of 42 tasks show a positive effect (4 ties, 0 negative; Clopper--Pearson 95% CI: [0.774, 0.973]). Error bars are approximate 95% CIs.

Theorems & Definitions (6)

  • Proposition 1: Crossover Threshold
  • proof
  • Remark 1: Nonlinear Generalization
  • Remark 2: Model Quality Irrelevance
  • Remark 3
  • Remark 4: Team Expansion