Pros and Cons of GAN Evaluation Measures
Ali Borji
TL;DR
The paper surveys two broad classes of GAN evaluation methods—quantitative and qualitative—across more than two dozen measures, and introduces desiderata to judge their quality. It critically analyzes well-known metrics (e.g., IS, FID, MMD, C2ST, NRDS, GAM) and a suite of qualitative tests, outlining strengths, limitations, and domain dependencies. The author argues that no single metric sufficiently captures fidelity, diversity, and latent-space properties, and emphasizes the need for standardized benchmarks, transparency, and application-specific evaluation. Practical guidance is provided on selecting measures based on fidelity versus diversity, with a call for open-source tooling and coordinated benchmarks to accelerate progress. The work highlights the trade-offs between perceptual alignment, statistical distance, and computational efficiency, and suggests future directions to harmonize GAN evaluation efforts.
Abstract
Generative models, in particular generative adversarial networks (GANs), have received significant attention recently. A number of GAN variants have been proposed and have been utilized in many applications. Despite large strides in terms of theoretical progress, evaluating and comparing GANs remains a daunting task. While several measures have been introduced, as of yet, there is no consensus as to which measure best captures strengths and limitations of models and should be used for fair model comparison. As in other areas of computer vision and machine learning, it is critical to settle on one or few good measures to steer the progress in this field. In this paper, I review and critically discuss more than 24 quantitative and 5 qualitative measures for evaluating generative models with a particular emphasis on GAN-derived models. I also provide a set of 7 desiderata followed by an evaluation of whether a given measure or a family of measures is compatible with them.
