Table of Contents
Fetching ...

Skill Rating for Generative Models

Catherine Olsson, Surya Bhupatiraju, Tom Brown, Augustus Odena, Ian Goodfellow

TL;DR

This paper introduces a tournament-based framework for evaluating generative models by pitting generators against discriminators and tracking latent skill via Elo/Glicko2-style ratings. It defines two metrics, tournament win rate and skill rating, and demonstrates their utility for both monitoring training progress and comparing trained GANs across seeds and architectures. The authors show the method generalizes to non-image domains and even to near-perfect generator scenarios, highlighting its strengths and limitations relative to likelihood-based and perceptual metrics. The work suggests a flexible, scalable, and reproducible alternative for GAN evaluation and outlines directions for refining tournaments and discriminators.

Abstract

We explore a new way to evaluate generative models using insights from evaluation of competitive games between human players. We show experimentally that tournaments between generators and discriminators provide an effective way to evaluate generative models. We introduce two methods for summarizing tournament outcomes: tournament win rate and skill rating. Evaluations are useful in different contexts, including monitoring the progress of a single model as it learns during the training process, and comparing the capabilities of two different fully trained models. We show that a tournament consisting of a single model playing against past and future versions of itself produces a useful measure of training progress. A tournament containing multiple separate models (using different seeds, hyperparameters, and architectures) provides a useful relative comparison between different trained GANs. Tournament-based rating methods are conceptually distinct from numerous previous categories of approaches to evaluation of generative models, and have complementary advantages and disadvantages.

Skill Rating for Generative Models

TL;DR

This paper introduces a tournament-based framework for evaluating generative models by pitting generators against discriminators and tracking latent skill via Elo/Glicko2-style ratings. It defines two metrics, tournament win rate and skill rating, and demonstrates their utility for both monitoring training progress and comparing trained GANs across seeds and architectures. The authors show the method generalizes to non-image domains and even to near-perfect generator scenarios, highlighting its strengths and limitations relative to likelihood-based and perceptual metrics. The work suggests a flexible, scalable, and reproducible alternative for GAN evaluation and outlines directions for refining tournaments and discriminators.

Abstract

We explore a new way to evaluate generative models using insights from evaluation of competitive games between human players. We show experimentally that tournaments between generators and discriminators provide an effective way to evaluate generative models. We introduce two methods for summarizing tournament outcomes: tournament win rate and skill rating. Evaluations are useful in different contexts, including monitoring the progress of a single model as it learns during the training process, and comparing the capabilities of two different fully trained models. We show that a tournament consisting of a single model playing against past and future versions of itself produces a useful measure of training progress. A tournament containing multiple separate models (using different seeds, hyperparameters, and architectures) provides a useful relative comparison between different trained GANs. Tournament-based rating methods are conceptually distinct from numerous previous categories of approaches to evaluation of generative models, and have complementary advantages and disadvantages.

Paper Structure

This paper contains 20 sections, 12 figures.

Figures (12)

  • Figure 1: Within-trajectory tournament outcomes for experiment 1. In the upper half of the figure: Figure \ref{['fig-monitor']}a-left shows raw tournament outcomes. Each pixel represents the average win rate between one generator and one discriminator from different iterations of experiment 1. Brighter pixel values represent stronger generator performance. Figure \ref{['fig-monitor']}a-right compares tournament summary measures to SVHN classifier score. Tournament win rate in this figure is the column-wise average of the pixel values in the heatmap. (Note that the classifier score at $i$=0 is lower than 4.0, which obscures the alignment between the rest of the curves when plotted on the same axis, so we omit it.) In the lower half of the figure: Figure \ref{['fig-monitor']}b shows the same data but with matchups from far-apart iterations omitted, shown as grey pixels in Figure \ref{['fig-monitor']}b-left. Figure \ref{['fig-monitor']}b-right shows that skill rating continues to track the improvement of the model, even though some of the most informative battles (between early generators and later discriminators, in the top left) have been omitted. whereas the tournament win rate is no longer informative.
  • Figure 2: Within-trajectory skill rating applied to drawings of apples. We evaluate a DCGAN trained on drawings of apples from the QuickDraw dataset. From left to right, subjective sample quality improves with more iterations. SVHN Classifier score is a poor judge of quality for these samples, rating iteration 0 the highest, and providing choppy but broadly worsening ratings thereafter. SVHN Fréchet distance is a better fit, rating sample quality as steadily increasing until iteration 1300; however, it saturates at this point, whereas subjective sample quality continues increasing. (Note the inverted y-axis on the Fréchet distance plot, such that lower distance (better quality) is plotted higher on the plot). Within-trajectory skill rating continues improving beyond iteration 1300.
  • Figure 3: Multiple-trajectory tournament outcomes. We run a tournament containing SVHN generator and discriminator snapshots from models with different seeds, hyperparameters, and architectures (described in Section \ref{['compare']}). We evaluate them using SVHN classifier score (left), SVHN Fréchet distance (center), and our skill rating method (right; see Section \ref{['skillrating']}). Each point represents the score of one iteration of one model. The overall trajectories show the improvement of each model with increasing training. Note the inverted y-axis on the Fréchet distance plot, such that lower distance (better quality) is plotted higher on the plot. The score of real data samples is shown as a black line. The score of 6-auto is evaluted from a single snapshot, rather than a full training curve, and is shown as a grey line. The learning curves produced by skill rating broadly agree with those produced by Fréchet distance, and disagree with classifier score only in the case of the conditional models 4-cond and 5-cond --- we speculate about this discrepancy in Section \ref{['compare']}.
  • Figure 4: Samples from fully-trained generative models. From each trained model, we show 64 samples (from iteration 200,000 of the GANs and epoch 106 of 6-auto), along with real data for comparison. Under each set of samples, we list the Glicko2 skill rating (SR), SVHN classifier score (CS), and SVHN Fréchet distance (FD) of the model. Our skill rating system ranks experiment 5-cond as being slightly worse than real data and slightly better than runner-ups 4-cond and 1, whereas classifier score ranks 5-cond better than real data, and Fréchet distance ranks 5-cond worse than both 4-cond and 1. Our system's rankings agree with Fréchet distance in all other cases.
  • Figure 5: Evaluating a near-perfect generator on a toy problem. We train an ordinary GAN to model a Gaussian distribution with a full covariance matrix. Generators from iteration 8000 onwards have mastered this task. Discriminators from iteration 8000 onwards no longer produce useful judgments (Figure \ref{['fig-chekhov-self-heatmap']}). Chekhov GAN discriminators beyond iteration 8000 retain their ability to judge past generators' samples (Figure \ref{['fig-chekhov-chekhov-heatmap']}). Figure \ref{['fig-chekhov-skillrating']} compares skill ratings from these discriminators with the ground truth performance of the ordinary generator, measured as the mean absolute difference between the generator's estimated covariance matrix and that of the data. Skill ratings against the Chekhov discriminator were a better fit to the ground truth than those from within-trajectory matches.
  • ...and 7 more figures