Beyond Arrow: From Impossibility to Possibilities in Multi-Criteria Benchmarking

Polina Gordienko; Christoph Jansen; Julian Rodemann; Georg Schollmeyer

Beyond Arrow: From Impossibility to Possibilities in Multi-Criteria Benchmarking

Polina Gordienko, Christoph Jansen, Julian Rodemann, Georg Schollmeyer

TL;DR

This paper reframes multi-criteria benchmarking as a social-choice problem in which each metric induces a model ranking on each dataset. It shows Arrow's impossibility can be avoided when input rankings obey domain restrictions such as single-peaked, group-separable, or distance-restricted preferences, enabling coherent and stable aggregations. Under these conditions, pairwise-majority benchmarking $B_M$ yields complete, transitive rankings and satisfies non-dictatorship, weak Pareto, and independence of irrelevant alternatives, with cross-dataset aggregation guided by depth-based centrality (generalized Tukey depth) and a commonality-sharing ranking. Empirically, using HELM MMLU and other benchmarks, the authors show that the presence of the structural properties depends on the metrics and model sets, with frequent Condorcet cycles when structure is lacking and a viable central ranking when it holds. These results provide a principled route to robust, interpretable multi-criteria benchmarking and motivate future work on broader benchmarks and potential strategic behavior in evaluation.

Abstract

Modern benchmarks such as HELM MMLU account for multiple metrics like accuracy, robustness and efficiency. When trying to turn these metrics into a single ranking, natural aggregation procedures can become incoherent or unstable to changes in the model set. We formalize this aggregation as a social choice problem where each metric induces a preference ranking over models on each dataset, and a benchmark operator aggregates these votes across metrics. While prior work has focused on Arrow's impossibility result, we argue that the impossibility often originates from pathological examples and identify sufficient conditions under which these disappear, and meaningful multi-criteria benchmarking becomes possible. In particular, we deal with three restrictions on the combinations of rankings and prove that on single-peaked, group-separable and distance-restricted preferences, the benchmark operator allows for the construction of well-behaved rankings of the involved models. Empirically, we investigate several modern benchmark suites like HELM MMLU and verify which structural conditions are fulfilled on which benchmark problems.

Beyond Arrow: From Impossibility to Possibilities in Multi-Criteria Benchmarking

TL;DR

yields complete, transitive rankings and satisfies non-dictatorship, weak Pareto, and independence of irrelevant alternatives, with cross-dataset aggregation guided by depth-based centrality (generalized Tukey depth) and a commonality-sharing ranking. Empirically, using HELM MMLU and other benchmarks, the authors show that the presence of the structural properties depends on the metrics and model sets, with frequent Condorcet cycles when structure is lacking and a viable central ranking when it holds. These results provide a principled route to robust, interpretable multi-criteria benchmarking and motivate future work on broader benchmarks and potential strategic behavior in evaluation.

Abstract

Paper Structure (30 sections, 12 theorems, 5 equations, 3 figures, 2 tables)

This paper contains 30 sections, 12 theorems, 5 equations, 3 figures, 2 tables.

Introduction
Related Work
Motivation
Benchmarking as a Social Choice Problem
Notation
Social Choice: From Impossibility to Possibility
Aggregation Across Datasets
Restricted Preference Domains in Benchmarking
Single-Peaked Preferences
Group-Separable Preferences
Distance-Restricted Preferences
Experiments
Experimental Setup
Results
Discussion
...and 15 more sections

Key Result

Theorem 4.1

arrow Suppose $k\geq 3$ and $n\geq 2$. Let $F$ be an operator that maps each profile from $($pref$(\mathcal{A}))^n$ (Universal Domain) to a preference relation in pref$(\mathcal{A})$ (Social Ordering). For a profile $R=(R_{1}, \dots , R_{n}) \in ($pref$(\mathcal{A}))^n$, let $P_{i}$ be the strict pa

Figures (3)

Figure 1: We fix seven language models and a set of accuracy and efficiency metrics from HELM. For each MMLU subject dataset, each metric induces a ranking of models; we study when these rankings are consistent with a single shared ordering of models (x-axis) so that each metric has one "sweet spot" (one peak) along that ordering. The subject Business Ethics (left) satisfies this structure; Abstract Algebra (right) does not.
Figure 2: Aggregated ranking across all MMLU subjects according to the commonality sharing rule for two domains of single-peaked preferences (left: $(\mathcal{A}_1, \Phi_{acc})$, right: $(\mathcal{A}_3, \Phi_{mix})$).
Figure 3: Aggregated ranking across all $57$ datasets of HELM MMLU according to the commonality sharing rule (cf., Section \ref{['section_aggregation_across_datasets']}) for the experiments for group separability (left, $(\mathcal{A}_2, \Phi_{acc})$ and $(\mathcal{A}_3, \Phi_{mix})$) and for distance-restrictedness (right, $(\mathcal{A}_4, \Phi_{acc})$ and $(\mathcal{A}_3, \Phi_{acc})$).

Theorems & Definitions (23)

Theorem 4.1
Definition 5.1
Theorem 5.2
Definition 5.3
Theorem 5.4
Definition 5.5
Theorem 5.6
Lemma 1.1
proof
Theorem 1.2
...and 13 more

Beyond Arrow: From Impossibility to Possibilities in Multi-Criteria Benchmarking

TL;DR

Abstract

Beyond Arrow: From Impossibility to Possibilities in Multi-Criteria Benchmarking

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (23)