Unifying and extending Precision Recall metrics for assessing generative models

Benjamin Sykes; Loic Simon; Julien Rabin

Unifying and extending Precision Recall metrics for assessing generative models

Benjamin Sykes, Loic Simon, Julien Rabin

TL;DR

This work addresses evaluating generative models by comparing real and generated distributions $P$ and $Q$ through a unified precision-recall frontier (PRD). It reinterprets various extreme-PR metrics within a binary-classification framework, extends them into full PR curves using kNN-based estimators, and provides a consistency analysis (with data splitting) alongside practical improvements (split, k, bandwidth, and KDE variants). Through experiments on Gaussian shifts and Gaussian mixtures, the authors show that full PR curves reveal mode dropping, invention, and re-weighting, and advocate Coverage-based variants as more robust than extreme-PR methods. The results offer actionable guidance for evaluating generative models, especially in high dimensions, and highlight avenues for convergence analysis and scalar summaries (F-scores and PR median) to ease practical comparisons.

Abstract

With the recent success of generative models in image and text, the evaluation of generative models has gained a lot of attention. Whereas most generative models are compared in terms of scalar values such as Frechet Inception Distance (FID) or Inception Score (IS), in the last years (Sajjadi et al., 2018) proposed a definition of precision-recall curve to characterize the closeness of two distributions. Since then, various approaches to precision and recall have seen the light (Kynkaanniemi et al., 2019; Naeem et al., 2020; Park & Kim, 2023). They center their attention on the extreme values of precision and recall, but apart from this fact, their ties are elusive. In this paper, we unify most of these approaches under the same umbrella, relying on the work of (Simon et al., 2019). Doing so, we were able not only to recover entire curves, but also to expose the sources of the accounted pitfalls of the concerned metrics. We also provide consistency results that go well beyond the ones presented in the corresponding literature. Last, we study the different behaviors of the curves obtained experimentally.

Unifying and extending Precision Recall metrics for assessing generative models

TL;DR

This work addresses evaluating generative models by comparing real and generated distributions

and

through a unified precision-recall frontier (PRD). It reinterprets various extreme-PR metrics within a binary-classification framework, extends them into full PR curves using kNN-based estimators, and provides a consistency analysis (with data splitting) alongside practical improvements (split, k, bandwidth, and KDE variants). Through experiments on Gaussian shifts and Gaussian mixtures, the authors show that full PR curves reveal mode dropping, invention, and re-weighting, and advocate Coverage-based variants as more robust than extreme-PR methods. The results offer actionable guidance for evaluating generative models, especially in high dimensions, and highlight avenues for convergence analysis and scalar summaries (F-scores and PR median) to ease practical comparisons.

Abstract

Paper Structure (34 sections, 2 theorems, 30 equations, 8 figures, 1 table)

This paper contains 34 sections, 2 theorems, 30 equations, 8 figures, 1 table.

Introduction
Recap on the relevant literature
The gist on the original PR curve notion
Re-assessing extreme precision-recall values
Improved PR metric and follow-up works
IPR
Coverage
EAS
PRC
PPR
Re-interpretation and improvements
Classification interpretation
IPR
Coverage
Note on symmetry
...and 19 more sections

Key Result

Proposition 2.2

Let $P, Q$ two distributions. Then all co-supports of $P$ and $Q$ have the same $Q$-mass and

Figures (8)

Figure 1: Right: the PR-curve is the frontier of the shaded area composed of all admissible PR pairs $(\beta,\alpha)$. In essence, these pairs represent the mass of $P$ and $Q$ that one can recover by selecting a subset of the common support (gray area on the left). More precisely, by selecting regions of high likelihood of $P$, one trades precision ($\alpha$) in favor of recall ($\beta$). The extreme values $\beta_0(P,Q)$ and $\alpha_\infty(P,Q)$ embody the respective masses of the entire common support.
Figure 2: Comparing two shifted Gaussians. The Ground-Truth PR curve ( - -GT) is compared to empirical estimates from various NN-classifiers: --iPR, --knn, --Parzen, and --Coverage. Here $P \sim \mathcal{N}(0,\mathbb{I}_{d})$ and $Q \sim \mathcal{N}(\mu \mathbf{1}_{d},\mathbb{I}_{d})$ with $d=64$ dimensions and $\mu=\frac{1}{\sqrt d}\approx.12$ or $\mu=\frac{3}{\sqrt d}\approx.38$. $n=10$K points are sampled using $k=4$ or $k=\sqrt n$ for NN comparison, with or without dataset validation/train split. (Curves are averaged over 10 random samples, see Appendix).
Figure 3: PR curves in high dimension Same experiment as in Fig. \ref{['fig:shift-gauss']} (50% split, $n=10$K, $k=\sqrt{n}$) with $d=2048$ dimensions and $\mu \in \{\frac{1}{\sqrt{d}},\frac{2}{\sqrt{d}}\}$.
Figure 4: Illustration of the impact of splitting for $P=Q$. The setting is the same as Fig. \ref{['fig:shift-gauss']} for a translation of $\mu=0$ between two Gaussian in dimension $d=64$ (curves are averaged over 3 random samples). The Ground-Truth PR curve ( - -GT) is compared to empirical estimates from various NN-classifiers: --iPR, --knn, --Parzen, and --Coverage. Top reports results without splitting : as reported in the literature, estimated extremal precision and recall values are not equal to 1, contrary to the ground-truth. Bottom curves, obtained with a 50% splits, are very close to the ideal curve.
Figure 5: Comparing two Gaussian mixtures. This figure complements Fig. \ref{['fig:GMM-dim64']}. The Ground-Truth PR curve ( - -GT) is compared to empirical estimates from various NN-classifiers: --iPR, --knn, --Parzen, and --Coverage. Here $P$ and $Q$ are two GMMs sharing the same modes (centered at $\mu_k$): $P \sim \sum_{\ell} p_\ell \mathcal{N}(\mu_\ell \mathbf 1_{d}, \mathbb{I}_{d})$ and $Q \sim \sim \sum_{\ell} q_\ell \mathcal{N}(\mu_k \mathbf 1_{d}, \mathbb{I}_{d})$ with $d=64$ dimensions and $\mu_\ell \in \{0, -5, 3, 5\}$. However, $P$ and $Q$ have different weights ($p_\ell$ and $q_\ell$) $p_\ell \in \{0.3, 0.2, 0.5, 0\}$$q_\ell \in \{0, 0.5, 0.2, 0.3\}$. $n=1$k points are sampled and split in half between validation and train, and $k=\sqrt n$.
...and 3 more figures

Theorems & Definitions (6)

Definition 2.1: support and co-support
Proposition 2.2
proof
Theorem 3.1
proof : Proof (sketch)
proof

Unifying and extending Precision Recall metrics for assessing generative models

TL;DR

Abstract

Unifying and extending Precision Recall metrics for assessing generative models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (6)