Grade Inflation in Generative Models
Phuc Nguyen, Miao Li, Alexandra Morgan, Rima Arnaout, Ramy Arnaout
TL;DR
The paper analyzes grade inflation in quality scores used to compare two-dimensional distributions of real versus synthetic data in generative models. It shows that equipoint scores including the correlation score, earth-mover's distance (EMD), Jaccard (IoU), and Kullback-Leibler (KL) divergence can produce inflated assessments even for suboptimal fits, while introducing Eden as an equidensity score linked to negative-order Rényi entropy. Eden avoids grade inflation and better matches human perceptual goodness-of-fit in head-to-head judgments, with validation from 20 experts comparing 39 pairs of KDE plots. The findings suggest equidensity scores offer more reliable evaluation for 2D and potentially broader low-dimensional distribution comparisons, with implications for calibration, sampling, and extensions to higher dimensions.
Abstract
Generative models hold great potential, but only if one can trust the evaluation of the data they generate. We show that many commonly used quality scores for comparing two-dimensional distributions of synthetic vs. ground-truth data give better results than they should, a phenomenon we call the "grade inflation problem." We show that the correlation score, Jaccard score, earth-mover's score, and Kullback-Leibler (relative-entropy) score all suffer grade inflation. We propose that any score that values all datapoints equally, as these do, will also exhibit grade inflation; we refer to such scores as "equipoint" scores. We introduce the concept of "equidensity" scores, and present the Eden score, to our knowledge the first example of such a score. We found that Eden avoids grade inflation and agrees better with human perception of goodness-of-fit than the equipoint scores above. We propose that any reasonable equidensity score will avoid grade inflation. We identify a connection between equidensity scores and Rényi entropy of negative order. We conclude that equidensity scores are likely to outperform equipoint scores for generative models, and for comparing low-dimensional distributions more generally.
