Table of Contents
Fetching ...

Grade Inflation in Generative Models

Phuc Nguyen, Miao Li, Alexandra Morgan, Rima Arnaout, Ramy Arnaout

TL;DR

The paper analyzes grade inflation in quality scores used to compare two-dimensional distributions of real versus synthetic data in generative models. It shows that equipoint scores including the correlation score, earth-mover's distance (EMD), Jaccard (IoU), and Kullback-Leibler (KL) divergence can produce inflated assessments even for suboptimal fits, while introducing Eden as an equidensity score linked to negative-order Rényi entropy. Eden avoids grade inflation and better matches human perceptual goodness-of-fit in head-to-head judgments, with validation from 20 experts comparing 39 pairs of KDE plots. The findings suggest equidensity scores offer more reliable evaluation for 2D and potentially broader low-dimensional distribution comparisons, with implications for calibration, sampling, and extensions to higher dimensions.

Abstract

Generative models hold great potential, but only if one can trust the evaluation of the data they generate. We show that many commonly used quality scores for comparing two-dimensional distributions of synthetic vs. ground-truth data give better results than they should, a phenomenon we call the "grade inflation problem." We show that the correlation score, Jaccard score, earth-mover's score, and Kullback-Leibler (relative-entropy) score all suffer grade inflation. We propose that any score that values all datapoints equally, as these do, will also exhibit grade inflation; we refer to such scores as "equipoint" scores. We introduce the concept of "equidensity" scores, and present the Eden score, to our knowledge the first example of such a score. We found that Eden avoids grade inflation and agrees better with human perception of goodness-of-fit than the equipoint scores above. We propose that any reasonable equidensity score will avoid grade inflation. We identify a connection between equidensity scores and Rényi entropy of negative order. We conclude that equidensity scores are likely to outperform equipoint scores for generative models, and for comparing low-dimensional distributions more generally.

Grade Inflation in Generative Models

TL;DR

The paper analyzes grade inflation in quality scores used to compare two-dimensional distributions of real versus synthetic data in generative models. It shows that equipoint scores including the correlation score, earth-mover's distance (EMD), Jaccard (IoU), and Kullback-Leibler (KL) divergence can produce inflated assessments even for suboptimal fits, while introducing Eden as an equidensity score linked to negative-order Rényi entropy. Eden avoids grade inflation and better matches human perceptual goodness-of-fit in head-to-head judgments, with validation from 20 experts comparing 39 pairs of KDE plots. The findings suggest equidensity scores offer more reliable evaluation for 2D and potentially broader low-dimensional distribution comparisons, with implications for calibration, sampling, and extensions to higher dimensions.

Abstract

Generative models hold great potential, but only if one can trust the evaluation of the data they generate. We show that many commonly used quality scores for comparing two-dimensional distributions of synthetic vs. ground-truth data give better results than they should, a phenomenon we call the "grade inflation problem." We show that the correlation score, Jaccard score, earth-mover's score, and Kullback-Leibler (relative-entropy) score all suffer grade inflation. We propose that any score that values all datapoints equally, as these do, will also exhibit grade inflation; we refer to such scores as "equipoint" scores. We introduce the concept of "equidensity" scores, and present the Eden score, to our knowledge the first example of such a score. We found that Eden avoids grade inflation and agrees better with human perception of goodness-of-fit than the equipoint scores above. We propose that any reasonable equidensity score will avoid grade inflation. We identify a connection between equidensity scores and Rényi entropy of negative order. We conclude that equidensity scores are likely to outperform equipoint scores for generative models, and for comparing low-dimensional distributions more generally.
Paper Structure (24 sections, 9 equations, 6 figures, 1 table)

This paper contains 24 sections, 9 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: The grade inflation problem. a: Two distributions from Anscombe's quartet anscombe_graphs_1973. Both have a Pearson's $R$ of 0.82, meaning their correlation score is 1.00 despite their differences (which are appreciable in their KDEs, right). Black lines show least-squares regression fits, to illustrate indistinguishable slopes and intercepts. b: A highly non-random distribution from the Datasaurus Dozen matejka_same_2017 and (Gaussian-distributed) synthetic data with the same means and standard deviations from an untrained generative model. Pearson's $R$ of -0.06 and -0.11, respectively, resulting in the very high correlation score of 0.97 despite the poor fit.
  • Figure 2: Fits scored in Table \ref{['table:scores']}. Left to right: real data (blue), generated synthetic data (yellow), and KDEs. a is considered a very low-quality fit; b-d are considered low-quality fits; e is considered a high-quality fit. Datasets, sizes, features, and models are as labeled. UCIMLR=UC Irvine Machine-Learning Repository. GC=Gaussian copula. nFlow=normalizing flow. ICA=independent component analysis.
  • Figure 3: The Eden score. The Eden score comparing the blue and yellow distributions in the high-quality fit in (a) is calculated as the mean intersection-over-union for each equidensity contour (ring, annulus) (b). Peaks, slopes, and foothills contribute equally. Top row; intersections; bottom row, unions (both in green). The ratios for the three contour levels shown are 0.77, 0.81, and 0.78 (left-to-right), which average to an Eden score of 0.79 for the fit of the two distributions in (a). For clarity, the score is calculated over three contours, instead of the five used in the rest of this study. c-d: Similar for a low-quality fit. The peaks are disjoint (ratio, 0.00), the slopes intersect by only a sliver (ratio, <0.01), and the foothills' intersection-over-union is 0.09, yielding an Eden score of 0.03.
  • Figure 4: Validation. 20 human experts were each asked to rate which of two fits was better for several pairs. Ratings were then compared against each of the five statistical scoring methods. Agreement between each human rater and the scoring method was measured by Cohen's $\kappa$. a: Percent of raters who agreed the most with each scoring method (Eden, KL, etc.). p-value is for Mann-Whitney U on the ranks. b: All $\kappa$ values for each score, with Mann-Whitney U p-values (Methods). c: Agreement between methods for test pairs (again measured by $\kappa$).
  • Figure 5: Oversampling affects scoring. a: Target (in-house "Stripes" dataset). b: Sample (orange) the same size as the target. Correlation, earth-mover's, Jaccard, KL, and Eden scores: 0.993, 0.941, 1.000, 0.996, and 0.452, respectively. (c) Oversampling. Scores (same order): 0.998, 0.992, 1.000, 0.862, and 0.160. Eden is the most sensitive to the difference in KDEs between normal sampling (b) and oversampling (c).
  • ...and 1 more figures