Table of Contents
Fetching ...

Position: All Current Generative Fidelity and Diversity Metrics are Flawed

Ossi Räisä, Boris van Breugel, Mihaela van der Schaar

TL;DR

This work argues that all current fidelity and diversity metrics for synthetic data are flawed, hindering reliable evaluation of generative models. It introduces six desiderata to guide metric design and a comprehensive suite of automated sanity checks that reveal where existing metrics fail. By evaluating a broad set of metrics against these checks, the authors demonstrate pervasive shortcomings and argue against relying on absolute evaluations. They outline practical guidance for practitioners, emphasize the need for new metrics, and propose directions such as moving beyond Euclidean geometry to improve robustness and reliability in synthetic-data evaluation.

Abstract

Any method's development and practical application is limited by our ability to measure its reliability. The popularity of generative modeling emphasizes the importance of good synthetic data metrics. Unfortunately, previous works have found many failure cases in current metrics, for example lack of outlier robustness and unclear lower and upper bounds. We propose a list of desiderata for synthetic data metrics, and a suite of sanity checks: carefully chosen simple experiments that aim to detect specific and known generative modeling failure modes. Based on these desiderata and the results of our checks, we arrive at our position: all current generative fidelity and diversity metrics are flawed. This significantly hinders practical use of synthetic data. Our aim is to convince the research community to spend more effort in developing metrics, instead of models. Additionally, through analyzing how current metrics fail, we provide practitioners with guidelines on how these metrics should (not) be used.

Position: All Current Generative Fidelity and Diversity Metrics are Flawed

TL;DR

This work argues that all current fidelity and diversity metrics for synthetic data are flawed, hindering reliable evaluation of generative models. It introduces six desiderata to guide metric design and a comprehensive suite of automated sanity checks that reveal where existing metrics fail. By evaluating a broad set of metrics against these checks, the authors demonstrate pervasive shortcomings and argue against relying on absolute evaluations. They outline practical guidance for practitioners, emphasize the need for new metrics, and propose directions such as moving beyond Euclidean geometry to improve robustness and reliability in synthetic-data evaluation.

Abstract

Any method's development and practical application is limited by our ability to measure its reliability. The popularity of generative modeling emphasizes the importance of good synthetic data metrics. Unfortunately, previous works have found many failure cases in current metrics, for example lack of outlier robustness and unclear lower and upper bounds. We propose a list of desiderata for synthetic data metrics, and a suite of sanity checks: carefully chosen simple experiments that aim to detect specific and known generative modeling failure modes. Based on these desiderata and the results of our checks, we arrive at our position: all current generative fidelity and diversity metrics are flawed. This significantly hinders practical use of synthetic data. Our aim is to convince the research community to spend more effort in developing metrics, instead of models. Additionally, through analyzing how current metrics fail, we provide practitioners with guidelines on how these metrics should (not) be used.

Paper Structure

This paper contains 78 sections, 19 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Total variation distance between a standard Gaussian and a Gaussian with mean $\mu 1_d$. The lower and upper bounds are based on Hellinger distance.
  • Figure 2: Total variation distance between a standard Gaussian and a Gaussian with covariance $\sigma^2 I_d$. The lower and upper bounds are based on Hellinger distance.
  • Figure 3: Samples of each component in the mode dropping and invention sanity check.
  • Figure 4: Samples from the sphere and torus distributions in the sphere vs. torus sanity check.
  • Figure 5: Comparison of pdfs for the real and synthetic distributions in the one vs. two modes check.
  • ...and 16 more figures