Table of Contents
Fetching ...

Re-evaluating Evaluation

David Balduzzi, Karl Tuyls, Julien Perolat, Thore Graepel

TL;DR

This paper tackles the problem of evaluating machine learning agents across increasingly diverse tasks by highlighting how freely additive benchmarks induce biases. It introduces Nash averaging, a maxent Nash-equilibrium-based method that yields invariant, interpretable evaluations by automatically adapting to redundancies in agents and tasks, in both AvA and AvT settings. The framework leverages antisymmetric logit matrices, Schur decompositions, and combinatorial Hodge theory to extend Elo into multidimensional forms (mElo) and to decompose interactions into transitive and cyclic components. Empirical results on Atari demonstrate that Nash averaging reveals core agents and environments and suggests that human performance aligns with the strongest agents under a balanced evaluation. Overall, the approach promotes inclusive, robust evaluation while acknowledging limitations tied to data quality and practical choices in benchmarking.

Abstract

Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort. In this paper we take a step back and propose Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios: agent-vs-agent and agent-vs-task. The key strength of Nash averaging is that it automatically adapts to redundancies in evaluation data, so that results are not biased by the incorporation of easy tasks or weak agents. Nash averaging thus encourages maximally inclusive evaluation -- since there is no harm (computational cost aside) from including all available tasks and agents.

Re-evaluating Evaluation

TL;DR

This paper tackles the problem of evaluating machine learning agents across increasingly diverse tasks by highlighting how freely additive benchmarks induce biases. It introduces Nash averaging, a maxent Nash-equilibrium-based method that yields invariant, interpretable evaluations by automatically adapting to redundancies in agents and tasks, in both AvA and AvT settings. The framework leverages antisymmetric logit matrices, Schur decompositions, and combinatorial Hodge theory to extend Elo into multidimensional forms (mElo) and to decompose interactions into transitive and cyclic components. Empirical results on Atari demonstrate that Nash averaging reveals core agents and environments and suggests that human performance aligns with the strongest agents under a balanced evaluation. Overall, the approach promotes inclusive, robust evaluation while acknowledging limitations tied to data quality and practical choices in benchmarking.

Abstract

Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort. In this paper we take a step back and propose Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios: agent-vs-agent and agent-vs-task. The key strength of Nash averaging is that it automatically adapts to redundancies in evaluation data, so that results are not biased by the incorporation of easy tasks or weak agents. Nash averaging thus encourages maximally inclusive evaluation -- since there is no harm (computational cost aside) from including all available tasks and agents.

Paper Structure

This paper contains 27 sections, 14 theorems, 44 equations, 6 figures.

Key Result

Proposition 1

Elo ratings are at a stationary point under batch updates iff the matrices of empirical probabilities and predicted probabilities have the same row-sums (or, equivalently the same column-sums):

Figures (6)

  • Figure 1: (A) The Nash ${\mathbf p}_a^*$ assigned to agents; (B) the Nash ${\mathbf p}_e^*$ assigned to environments.
  • Figure 2: Comparison of uniform and Nash averages. (A) Skill of agents by uniform $\frac{1}{n}{\mathbf S}\cdot{\mathbf 1}$ and Nash ${\mathbf S}\cdot {\mathbf p}_e^*$ averaging over environments. (B) Difficulty of environments under uniform $-\frac{1}{m}{\mathbf S}^\intercal\cdot{\mathbf 1}$ and Nash $-{\mathbf S}^\intercal\cdot{\mathbf p}_a^*$ averaging over agents. Agents and environments are sorted by Nash-averages.
  • Figure 3: Visualizing Schur decompositions. (A) Rows of ${\mathbf Q}^{\mathbf T}_{4\times 2}$ form a straight line, reflecting the transitive structure of ${\mathbf T}$. (B): Rows of ${\mathbf Q}^{\mathbf C}_{4\times 2}$ lie on a circle centered at the origin.
  • Figure 4: Visualizing logits. The entry ${\mathbf A}_{ij}$ of ${\mathbf A}$ is $\lambda$ times the signed area of the parallelogram covered by the origin $(0,0)$, ${\mathbf Q}_{i, \bullet}$, ${\mathbf Q}_{i, \bullet}+{\mathbf Q}_{j,\bullet}$ and ${\mathbf Q}_{j,\bullet}$, where ${\mathbf Q}_{i, \bullet}$ and ${\mathbf Q}_{j, \bullet}$ are vectors corresponding to row $i$ and row $j$ of ${\mathbf Q}_{4\times 2}$.
  • Figure 5: Evaluation of environments.
  • ...and 1 more figures

Theorems & Definitions (30)

  • Proposition 1
  • Proposition 2
  • Theorem : Hodge decomposition, jiang:11
  • Proposition 3
  • Definition 1
  • Proposition 4: maxent NE
  • Definition 2
  • Example 1: invariance
  • Theorem 1: main result for AvA
  • Example 2: continuity
  • ...and 20 more