Skill Rating for Generative Models
Catherine Olsson, Surya Bhupatiraju, Tom Brown, Augustus Odena, Ian Goodfellow
TL;DR
This paper introduces a tournament-based framework for evaluating generative models by pitting generators against discriminators and tracking latent skill via Elo/Glicko2-style ratings. It defines two metrics, tournament win rate and skill rating, and demonstrates their utility for both monitoring training progress and comparing trained GANs across seeds and architectures. The authors show the method generalizes to non-image domains and even to near-perfect generator scenarios, highlighting its strengths and limitations relative to likelihood-based and perceptual metrics. The work suggests a flexible, scalable, and reproducible alternative for GAN evaluation and outlines directions for refining tournaments and discriminators.
Abstract
We explore a new way to evaluate generative models using insights from evaluation of competitive games between human players. We show experimentally that tournaments between generators and discriminators provide an effective way to evaluate generative models. We introduce two methods for summarizing tournament outcomes: tournament win rate and skill rating. Evaluations are useful in different contexts, including monitoring the progress of a single model as it learns during the training process, and comparing the capabilities of two different fully trained models. We show that a tournament consisting of a single model playing against past and future versions of itself produces a useful measure of training progress. A tournament containing multiple separate models (using different seeds, hyperparameters, and architectures) provides a useful relative comparison between different trained GANs. Tournament-based rating methods are conceptually distinct from numerous previous categories of approaches to evaluation of generative models, and have complementary advantages and disadvantages.
