Deviation Ratings: A General, Clone-Invariant Rating Method
Luke Marris, Siqi Liu, Ian Gemp, Georgios Piliouras, Marc Lanctot
TL;DR
This work addresses rating strategies in N-player general-sum normal-form games by introducing Deviation Rating, a clone-invariant, mixture-invariant, and offset-invariant method based on coarse correlated equilibria. By selecting for unique deviation gains through a sequence of linear programs, the approach avoids equilibrium-selection pitfalls while guaranteeing existence and uniqueness of ratings. The method is demonstrated across illustrative cyclic/coordination games and language-model evaluation datasets, showing that deviation ratings neutralize redundancy and reveal niche strengths, while remaining scalable to large, real-world evaluation data. The practical impact is a robust, data-agnostic framework for evaluating multi-agent systems (notably LLMs) that can guide targeted model improvement without curation or vulnerability to clone attacks.
Abstract
Many real-world multi-agent or multi-task evaluation scenarios can be naturally modelled as normal-form games due to inherent strategic (adversarial, cooperative, and mixed motive) interactions. These strategic interactions may be agentic (e.g. players trying to win), fundamental (e.g. cost vs quality), or complementary (e.g. niche finding and specialization). In such a formulation, it is the strategies (actions, policies, agents, models, tasks, prompts, etc.) that are rated. However, the rating problem is complicated by redundancy and complexity of N-player strategic interactions. Repeated or similar strategies can distort ratings for those that counter or complement them. Previous work proposed ``clone invariant'' ratings to handle such redundancies, but this was limited to two-player zero-sum (i.e. strictly competitive) interactions. This work introduces the first N-player general-sum clone invariant rating, called deviation ratings, based on coarse correlated equilibria. The rating is explored on several domains including LLMs evaluation.
