Re-evaluating Evaluation
David Balduzzi, Karl Tuyls, Julien Perolat, Thore Graepel
TL;DR
This paper tackles the problem of evaluating machine learning agents across increasingly diverse tasks by highlighting how freely additive benchmarks induce biases. It introduces Nash averaging, a maxent Nash-equilibrium-based method that yields invariant, interpretable evaluations by automatically adapting to redundancies in agents and tasks, in both AvA and AvT settings. The framework leverages antisymmetric logit matrices, Schur decompositions, and combinatorial Hodge theory to extend Elo into multidimensional forms (mElo) and to decompose interactions into transitive and cyclic components. Empirical results on Atari demonstrate that Nash averaging reveals core agents and environments and suggests that human performance aligns with the strongest agents under a balanced evaluation. Overall, the approach promotes inclusive, robust evaluation while acknowledging limitations tied to data quality and practical choices in benchmarking.
Abstract
Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort. In this paper we take a step back and propose Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios: agent-vs-agent and agent-vs-task. The key strength of Nash averaging is that it automatically adapts to redundancies in evaluation data, so that results are not biased by the incorporation of easy tasks or weak agents. Nash averaging thus encourages maximally inclusive evaluation -- since there is no harm (computational cost aside) from including all available tasks and agents.
