Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models

Dogyun Park; Suhyun Kim

Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models

Dogyun Park, Suhyun Kim

TL;DR

The paper tackles unreliable evaluation of generative models by exposing core weaknesses in $k$-NN based fidelity and diversity metrics, particularly sensitivity to outliers and insensitivity to distributional changes. It introduces PP&PR, a probabilistic framework with PSR, to estimate how likely fake samples belong to the real support and vice versa, yielding robust two-value measures: $\text{P-precision}$ and $\text{P-recall}$, defined via $\text{PSR}_P(y_j)$ and $\text{PSR}_Q(x_i)$ with a fixed radius $R$ and distance-based probabilities. Through toy experiments and real-model benchmarks (diffusion, StyleGAN, BigGAN) across multiple datasets, PP&PR demonstrates improved reliability, stability, and faithful reflection of fidelity-diversity balances, outperforming IP&IR and D&C under outliers and distributional shifts. The results suggest PP&PR can provide clearer insight into model quality and trade-offs, enabling more reliable model comparison and progress in generative modeling. $R$-based probabilistic subsupport estimation, identical-$R$ normalization, and $k$-NN guided hyperparameters together underpin the method’s robustness and interpretability.

Abstract

Assessing the fidelity and diversity of the generative model is a difficult but important issue for technological advancement. So, recent papers have introduced k-Nearest Neighbor ($k$NN) based precision-recall metrics to break down the statistical distance into fidelity and diversity. While they provide an intuitive method, we thoroughly analyze these metrics and identify oversimplified assumptions and undesirable properties of kNN that result in unreliable evaluation, such as susceptibility to outliers and insensitivity to distributional changes. Thus, we propose novel metrics, P-precision and P-recall (PP\&PR), based on a probabilistic approach that address the problems. Through extensive investigations on toy experiments and state-of-the-art generative models, we show that our PP\&PR provide more reliable estimates for comparing fidelity and diversity than the existing metrics. The codes are available at \url{https://github.com/kdst-team/Probablistic_precision_recall}.

Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models

TL;DR

The paper tackles unreliable evaluation of generative models by exposing core weaknesses in

-NN based fidelity and diversity metrics, particularly sensitivity to outliers and insensitivity to distributional changes. It introduces PP&PR, a probabilistic framework with PSR, to estimate how likely fake samples belong to the real support and vice versa, yielding robust two-value measures:

and

, defined via

and

with a fixed radius

and distance-based probabilities. Through toy experiments and real-model benchmarks (diffusion, StyleGAN, BigGAN) across multiple datasets, PP&PR demonstrates improved reliability, stability, and faithful reflection of fidelity-diversity balances, outperforming IP&IR and D&C under outliers and distributional shifts. The results suggest PP&PR can provide clearer insight into model quality and trade-offs, enabling more reliable model comparison and progress in generative modeling.

-based probabilistic subsupport estimation, identical-

normalization, and

-NN guided hyperparameters together underpin the method’s robustness and interpretability.

Abstract

Assessing the fidelity and diversity of the generative model is a difficult but important issue for technological advancement. So, recent papers have introduced k-Nearest Neighbor (

NN) based precision-recall metrics to break down the statistical distance into fidelity and diversity. While they provide an intuitive method, we thoroughly analyze these metrics and identify oversimplified assumptions and undesirable properties of kNN that result in unreliable evaluation, such as susceptibility to outliers and insensitivity to distributional changes. Thus, we propose novel metrics, P-precision and P-recall (PP\&PR), based on a probabilistic approach that address the problems. Through extensive investigations on toy experiments and state-of-the-art generative models, we show that our PP\&PR provide more reliable estimates for comparing fidelity and diversity than the existing metrics. The codes are available at \url{https://github.com/kdst-team/Probablistic_precision_recall}.

Paper Structure (39 sections, 28 equations, 12 figures, 3 tables)

This paper contains 39 sections, 28 equations, 12 figures, 3 tables.

Introduction
Preliminary
Feature embedding for evaluation
$k$NN-based fidelity and diversity measures
Improved Precision and Recall (IP&IR).
Density and Coverage (D&C).
Limitations with scoring rules of IP&IR and D&C
Binary scoring rule.
Density scoring rule.
Coverage scoring rule.
Method
P-precision and P-recall (PP&PR)
Probabilistic scoring rule
Experiments
Toy experiments
...and 24 more sections

Figures (12)

Figure 1: Examples of IP&IR, D&C, and PP&PR (better zoom-in). For simplicity, we use $k=2$ for $k$-Nearest Neighbor in all metrics. (a), (c): Due to the overestimation of $k$NN by the outlier and constant-density assumption within hyperspheres, I-precision, I-recall, and Density denote the same values for different $y$ in an overly large space (See $y_1$, $y_2$, and $y_3$ in (a)), leading to unreliably-overestimated values. (b), (e): On the other hand, our P-precision and P-recall assign different scores to different $y$ based on a probabilistic approach and address overestimation of $k$NN (See Sec. \ref{['sec:method']} for details). (d): This illustrates a case when Coverage exhibits its conceptual limitation. Even if the fake samples have lower relative diversity compared to real samples, they can still be included in multiple hyperspheres of real samples, leading to high Coverage.
Figure 2: (a) Behavior of fidelity metrics between two Gaussian distributions $X \sim N(0,I)$ and $Y \sim N(u\textbf{1},I)$ as $u$ moves between [-3,3] with outlier $x_o \sim N(-\textbf{2},I)$ added to $X$. (b) Behavior of diversity metrics between two Gaussian distributions $X \sim N(u\textbf{1},I)$ and $Y \sim N(0,I)$ as $u$ moves between [-3,3] with outlier $y_o \sim N(\textbf{2}, I)$ added to $Y$. (c) Estimated bias from the presumed true value between two identical Gaussian distributions for different numbers of datasets $N$. The line shows the means and the shaded area denotes standard deviations across 50 runs.
Figure 3: Ablation over $k$. We measure each metric for different $k$ between $X\sim N(0,I)$ and $Y \sim N(u,I)$ as $u \in$ [-3.0,3.0] with outlier $x_o \sim N(-2,I)$ added to $X$.
Figure 4: Qualitative examples sorted according to $L$. We used StyleGAN karras2019style trained on FFHQ karras2019style and BigGAN brock2018large trained on CIFAR-10 krizhevsky2012imagenet. The top two rows are images with the highest $L$ meaning they have high PSR but low DSR. Conversely, the bottom two rows are images with the lowest $L$, indicating low PSR and high DSR.
Figure 5: Behavior of metrics between $X \sim N(0,I)$ and $Y \sim N(0, vI)$ as $v$ changes between [0.2, 1.5]. Because Density goes over 1 (up to nearly 1000), the $y$-axis for Density is on the right side of the plot for better visualization.
...and 7 more figures

Theorems & Definitions (1)

proof

Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models

TL;DR

Abstract

Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)

Theorems & Definitions (1)