Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models
Dogyun Park, Suhyun Kim
TL;DR
The paper tackles unreliable evaluation of generative models by exposing core weaknesses in $k$-NN based fidelity and diversity metrics, particularly sensitivity to outliers and insensitivity to distributional changes. It introduces PP&PR, a probabilistic framework with PSR, to estimate how likely fake samples belong to the real support and vice versa, yielding robust two-value measures: $\text{P-precision}$ and $\text{P-recall}$, defined via $\text{PSR}_P(y_j)$ and $\text{PSR}_Q(x_i)$ with a fixed radius $R$ and distance-based probabilities. Through toy experiments and real-model benchmarks (diffusion, StyleGAN, BigGAN) across multiple datasets, PP&PR demonstrates improved reliability, stability, and faithful reflection of fidelity-diversity balances, outperforming IP&IR and D&C under outliers and distributional shifts. The results suggest PP&PR can provide clearer insight into model quality and trade-offs, enabling more reliable model comparison and progress in generative modeling. $R$-based probabilistic subsupport estimation, identical-$R$ normalization, and $k$-NN guided hyperparameters together underpin the method’s robustness and interpretability.
Abstract
Assessing the fidelity and diversity of the generative model is a difficult but important issue for technological advancement. So, recent papers have introduced k-Nearest Neighbor ($k$NN) based precision-recall metrics to break down the statistical distance into fidelity and diversity. While they provide an intuitive method, we thoroughly analyze these metrics and identify oversimplified assumptions and undesirable properties of kNN that result in unreliable evaluation, such as susceptibility to outliers and insensitivity to distributional changes. Thus, we propose novel metrics, P-precision and P-recall (PP\&PR), based on a probabilistic approach that address the problems. Through extensive investigations on toy experiments and state-of-the-art generative models, we show that our PP\&PR provide more reliable estimates for comparing fidelity and diversity than the existing metrics. The codes are available at \url{https://github.com/kdst-team/Probablistic_precision_recall}.
