Table of Contents
Fetching ...

Impacts of Aggregation on Model Diversity and Consumer Utility

Kate Donahue, Manish Raghavan

TL;DR

This work proposes a new mechanism, weighted winrate, which rewards models for answers that are higher quality, and shows that it provably improves incentives for producers to specialize and increases consumer welfare.

Abstract

Consider a marketplace of AI tools, each with slightly different strengths and weaknesses. By selecting the right model for the task at hand, a user can do better than simply committing to a single model for everything. Routers operate under a similar principle, where sophisticated model selection can increase overall performance. However, aggregation is often noisy, reflecting in imperfect user choices or routing decisions. This leads to two main questions: first, what does a "healthy marketplace" of models look like for maximizing consumer utility? Secondly, how can we incentivize producers to create such models? Here, we study two types of model changes: market entry (where an entirely new model is created and added to the set of available models), and model replacement (where an existing model has its strengths and weaknesses changed). We show that winrate, a standard benchmark in LLM evaluation, can incentivize model creators to homogenize for both types of model changes, reducing consumer welfare. We propose a new mechanism, weighted winrate, which rewards models for answers that are higher quality, and show that it provably improves incentives for producers to specialize and increases consumer welfare. We conclude by demonstrating that our theoretical results generalize to empirical benchmark datasets and discussing implications for evaluation design.

Impacts of Aggregation on Model Diversity and Consumer Utility

TL;DR

This work proposes a new mechanism, weighted winrate, which rewards models for answers that are higher quality, and shows that it provably improves incentives for producers to specialize and increases consumer welfare.

Abstract

Consider a marketplace of AI tools, each with slightly different strengths and weaknesses. By selecting the right model for the task at hand, a user can do better than simply committing to a single model for everything. Routers operate under a similar principle, where sophisticated model selection can increase overall performance. However, aggregation is often noisy, reflecting in imperfect user choices or routing decisions. This leads to two main questions: first, what does a "healthy marketplace" of models look like for maximizing consumer utility? Secondly, how can we incentivize producers to create such models? Here, we study two types of model changes: market entry (where an entirely new model is created and added to the set of available models), and model replacement (where an existing model has its strengths and weaknesses changed). We show that winrate, a standard benchmark in LLM evaluation, can incentivize model creators to homogenize for both types of model changes, reducing consumer welfare. We propose a new mechanism, weighted winrate, which rewards models for answers that are higher quality, and show that it provably improves incentives for producers to specialize and increases consumer welfare. We conclude by demonstrating that our theoretical results generalize to empirical benchmark datasets and discussing implications for evaluation design.
Paper Structure (32 sections, 47 theorems, 96 equations, 10 figures, 2 tables)

This paper contains 32 sections, 47 theorems, 96 equations, 10 figures, 2 tables.

Key Result

Lemma 1

Adding a model $B$ to a model $A$ increases consumer welfare with BTL aggregation if and only if:

Figures (10)

  • Figure 1: Plot of Equation \ref{['eq:deltaone']} for a single task, with $\beta=1$: equivalent to $\frac{\exp(\Delta/\beta)}{\exp(\Delta/\beta) + 1} \cdot \Delta$, for $\Delta = v_{b} -v_{a}$. Note that in the example in Lemma \ref{['lem:exnonmonotone']}, as $\Delta = v_b - v_a$ goes from $-4$ to $-1$, the curve becomes more negative, meaning that it reduces consumer welfare.
  • Figure 2: Figure demonstrating the bound in Lemma \ref{['lem:sufficientmonotone']}, where there is a set of values $[1, 3, 5, v]$, with $v$ varying along the $x$ axis and $\beta$ varying along the $y$ axis. Blue regions denote settings where BTL aggregation is in the monotone region with respect to $v$, and red regions denote settings where it is not. The black line gives the bound in Lemma \ref{['lem:sufficientmonotone']}: above this line BTL aggregation is guaranteed to be in the monotone region with respect to $v$.
  • Figure 3: Average value of combinations of models (restricted to top 10 models for computational reasons).
  • Figure 4: Fraction of derivatives of $v(S)$ that are negative.
  • Figure 5: Given allocations in Table \ref{['tab:max_alloc']}, the change in consumer welfare. The prior aggregate value of the existing marketplace is given in gray, change in value induced by winrate is given in red, and change in value induced by consumer welfare or weighted winrate incentives are given in blue. Overlapping regions appear in purple. Note that even though the same total amount is allocated across tasks, the total improvement with weighted winrate/consumer welfare objective is larger. The only task where winrate outperforms weighted winrate/consumer value is CC, where weighted winrate/consumer value best responses abstain or return 0 value.
  • ...and 5 more figures

Theorems & Definitions (89)

  • Definition 1: Random
  • Definition 2: Optimal
  • Definition 3: Bradley-Terry-Luce
  • Definition 4: Winrate
  • Definition 5: Weighted winrate (ours)
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Definition 6: Monotone utility
  • Lemma 4
  • ...and 79 more