Unifying and extending Precision Recall metrics for assessing generative models
Benjamin Sykes, Loic Simon, Julien Rabin
TL;DR
This work addresses evaluating generative models by comparing real and generated distributions $P$ and $Q$ through a unified precision-recall frontier (PRD). It reinterprets various extreme-PR metrics within a binary-classification framework, extends them into full PR curves using kNN-based estimators, and provides a consistency analysis (with data splitting) alongside practical improvements (split, k, bandwidth, and KDE variants). Through experiments on Gaussian shifts and Gaussian mixtures, the authors show that full PR curves reveal mode dropping, invention, and re-weighting, and advocate Coverage-based variants as more robust than extreme-PR methods. The results offer actionable guidance for evaluating generative models, especially in high dimensions, and highlight avenues for convergence analysis and scalar summaries (F-scores and PR median) to ease practical comparisons.
Abstract
With the recent success of generative models in image and text, the evaluation of generative models has gained a lot of attention. Whereas most generative models are compared in terms of scalar values such as Frechet Inception Distance (FID) or Inception Score (IS), in the last years (Sajjadi et al., 2018) proposed a definition of precision-recall curve to characterize the closeness of two distributions. Since then, various approaches to precision and recall have seen the light (Kynkaanniemi et al., 2019; Naeem et al., 2020; Park & Kim, 2023). They center their attention on the extreme values of precision and recall, but apart from this fact, their ties are elusive. In this paper, we unify most of these approaches under the same umbrella, relying on the work of (Simon et al., 2019). Doing so, we were able not only to recover entire curves, but also to expose the sources of the accounted pitfalls of the concerned metrics. We also provide consistency results that go well beyond the ones presented in the corresponding literature. Last, we study the different behaviors of the curves obtained experimentally.
