Table of Contents
Fetching ...

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

Guanhua Zhang, Moritz Hardt

TL;DR

This work reframes multi-task benchmarks as social choice problems where tasks vote on models, revealing a fundamental trade-off between diversity of task rankings and stability to irrelevant changes. The authors introduce two metrics, diversity via reversed Kendall's $W$ and sensitivity via Kendall's $\tau$ and MRC, and develop efficient approximation algorithms to compute them for cardinal and ordinal benchmarks. They show empirically across seven cardinal and eleven ordinal benchmarks that increasing diversity worsens stability to trivial task changes, while excessive robustness reduces diversity, implying no one-size-fits-all benchmark design. The study also demonstrates that many popular benchmarks exhibit significant sensitivity to irrelevant changes, highlighting the need for careful benchmark construction and reporting; code and data are provided for replication.

Abstract

We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model ranking; the latter aggregate rankings for each task. We apply Arrow's impossibility theorem to ordinal benchmarks to highlight the inherent limitations of ordinal systems, particularly their sensitivity to the inclusion of irrelevant models. Inspired by Arrow's theorem, we empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. Sensitivity quantifies the impact that irrelevant changes to tasks have on a benchmark. Diversity captures the degree of disagreement in model rankings across tasks. We develop efficient approximation algorithms for both measures, as exact computation is computationally challenging. Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial changes it is. Additionally, we show that the aggregated rankings of existing benchmarks are highly unstable under irrelevant changes. The codes and data are available at https://socialfoundations.github.io/benchbench/.

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

TL;DR

This work reframes multi-task benchmarks as social choice problems where tasks vote on models, revealing a fundamental trade-off between diversity of task rankings and stability to irrelevant changes. The authors introduce two metrics, diversity via reversed Kendall's and sensitivity via Kendall's and MRC, and develop efficient approximation algorithms to compute them for cardinal and ordinal benchmarks. They show empirically across seven cardinal and eleven ordinal benchmarks that increasing diversity worsens stability to trivial task changes, while excessive robustness reduces diversity, implying no one-size-fits-all benchmark design. The study also demonstrates that many popular benchmarks exhibit significant sensitivity to irrelevant changes, highlighting the need for careful benchmark construction and reporting; code and data are provided for replication.

Abstract

We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model ranking; the latter aggregate rankings for each task. We apply Arrow's impossibility theorem to ordinal benchmarks to highlight the inherent limitations of ordinal systems, particularly their sensitivity to the inclusion of irrelevant models. Inspired by Arrow's theorem, we empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. Sensitivity quantifies the impact that irrelevant changes to tasks have on a benchmark. Diversity captures the degree of disagreement in model rankings across tasks. We develop efficient approximation algorithms for both measures, as exact computation is computationally challenging. Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial changes it is. Additionally, we show that the aggregated rankings of existing benchmarks are highly unstable under irrelevant changes. The codes and data are available at https://socialfoundations.github.io/benchbench/.
Paper Structure (27 sections, 4 theorems, 9 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 27 sections, 4 theorems, 9 equations, 7 figures, 1 table, 2 algorithms.

Key Result

Theorem 3.1

No ordinal benchmark $f^{\text{o}}$ can fulfill the following conditions simultaneously:

Figures (7)

  • Figure 1: Ranking changes after irrelevant changes on tasks. For cardinal benchmark OpenLLM (left), Before refers to the original ranking, and After is the new ranking after injecting label noises into different tasks. For ordinal benchmark HELM-accuracy (right), Before refers to the ranking based on only the original top-20% models, while After is the new relative ranking after adding irrelevant models from the rest 80%. $y$-axis refers to the ranking.
  • Figure 2: Trade-off between benchmark diversity and sensitivity to irrelevant changes. Left: Cardinal benchmarks. Right: Ordinal benchmarks. Sensitivity is measured in terms of the maximum normalized rank change (MRC) possible via irrelevant task changes. Diversity is measured by Kendall's coefficient of concordance ($W$). The green curve is a linear regression on all points without fitting the intercept.
  • Figure 3: The $x$-axis indicates the diversity of model rankings across tasks, evaluated by the Kendall's $W$ coefficient. The $y$-axis represents the sensitivity of the final model ranking to different portions of label noise across tasks. The ranking change is measured by both Kendall's $\tau$ (top) and MRC (bottom). The green curve is by linear regression on all points without fitting intercept.
  • Figure 4: Sensitivity of cardinal benchmarks as a function of the minimal preserving ratio $\epsilon$. $x$-axis refers to the minimal preserving portion of unchanged examples, $\epsilon$, as stated in equation \ref{['eq:cardinal_obj']}. The $y$-axis refers to sensitivity measured by $\tau$ (top) and MRC (bottom).
  • Figure 5: The $x$-axis indicates the diversity of model rankings across tasks, evaluated by the reversed Kendall's $W$ coefficient, where $W=0$ denotes uniformity in rankings, while $W=1$ means random or highly varied rankings across tasks. The $y$-axis represents the sensitivity to irrelevant candidate models addition, measured by the Kendall's $\tau$ (top) and MRC (bottom). The green curve is by linear regression on all points without fitting intercept.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Theorem 3.1: Arrow's Impossibility Theorem for Benchmarks
  • Theorem B.1: Arrow's Impossibility Theorem for Benchmarks
  • Lemma B.2: Field Expansion Lemma
  • proof
  • Lemma B.3: Group Contraction Lemma
  • proof
  • proof