Table of Contents
Fetching ...

A Unifying Information-theoretic Perspective on Evaluating Generative Models

Alexis Fox, Samarth Swarup, Abhijin Adiga

TL;DR

The paper tackles the challenge of evaluating generative models by proposing a unifying information-theoretic framework for precision and recall-based metrics. It introduces a tri-dimensional metric with $PCE$, $RCE$, and $RE$, grounded in entropy and cross-entropy, capable of assessing fidelity as well as inter- and intra-class diversity and enabling population- and sample-level diagnostics. The authors connect existing precision/recall metrics to a general divergence view, derive empirical definitions via kNN estimators, and demonstrate through experiments that the new components correlate with human judgments and diagnose failure modes (mode invention, dropping, shrinkage) more effectively than one-dimensional metrics like FD. The framework is domain-agnostic and supports diagnostic, modality-agnostic evaluation with practical relevance for model selection and auditing. Code and reproducibility resources are provided to facilitate adoption across data modalities.

Abstract

Considering the difficulty of interpreting generative model output, there is significant current research focused on determining meaningful evaluation metrics. Several recent approaches utilize "precision" and "recall," borrowed from the classification domain, to individually quantify the output fidelity (realism) and output diversity (representation of the real data variation), respectively. With the increase in metric proposals, there is a need for a unifying perspective, allowing for easier comparison and clearer explanation of their benefits and drawbacks. To this end, we unify a class of kth-nearest-neighbors (kNN)-based metrics under an information-theoretic lens using approaches from kNN density estimation. Additionally, we propose a tri-dimensional metric composed of Precision Cross-Entropy (PCE), Recall Cross-Entropy (RCE), and Recall Entropy (RE), which separately measure fidelity and two distinct aspects of diversity, inter- and intra-class. Our domain-agnostic metric, derived from the information-theoretic concepts of entropy and cross-entropy, can be dissected for both sample- and mode-level analysis. Our detailed experimental results demonstrate the sensitivity of our metric components to their respective qualities and reveal undesirable behaviors of other metrics.

A Unifying Information-theoretic Perspective on Evaluating Generative Models

TL;DR

The paper tackles the challenge of evaluating generative models by proposing a unifying information-theoretic framework for precision and recall-based metrics. It introduces a tri-dimensional metric with , , and , grounded in entropy and cross-entropy, capable of assessing fidelity as well as inter- and intra-class diversity and enabling population- and sample-level diagnostics. The authors connect existing precision/recall metrics to a general divergence view, derive empirical definitions via kNN estimators, and demonstrate through experiments that the new components correlate with human judgments and diagnose failure modes (mode invention, dropping, shrinkage) more effectively than one-dimensional metrics like FD. The framework is domain-agnostic and supports diagnostic, modality-agnostic evaluation with practical relevance for model selection and auditing. Code and reproducibility resources are provided to facilitate adoption across data modalities.

Abstract

Considering the difficulty of interpreting generative model output, there is significant current research focused on determining meaningful evaluation metrics. Several recent approaches utilize "precision" and "recall," borrowed from the classification domain, to individually quantify the output fidelity (realism) and output diversity (representation of the real data variation), respectively. With the increase in metric proposals, there is a need for a unifying perspective, allowing for easier comparison and clearer explanation of their benefits and drawbacks. To this end, we unify a class of kth-nearest-neighbors (kNN)-based metrics under an information-theoretic lens using approaches from kNN density estimation. Additionally, we propose a tri-dimensional metric composed of Precision Cross-Entropy (PCE), Recall Cross-Entropy (RCE), and Recall Entropy (RE), which separately measure fidelity and two distinct aspects of diversity, inter- and intra-class. Our domain-agnostic metric, derived from the information-theoretic concepts of entropy and cross-entropy, can be dissected for both sample- and mode-level analysis. Our detailed experimental results demonstrate the sensitivity of our metric components to their respective qualities and reveal undesirable behaviors of other metrics.

Paper Structure

This paper contains 18 sections, 35 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: We visualize the various failure modes (a)-(c) for a model distribution, where the real (green) distribution is composed of two modes: sport cars and pickup trucks. (a) represents a lack of generated pickup trucks, (b) and its corresponding examples show unrealistic points, and (c) depicts points clustered at the mode average.
  • Figure 2: Pearson correlation heatmap between the precision measures and human error rate, displayed separately for the CIFAR-10 and ImageNet datasets. The direction of the scores where higher values indicate poorer precision was flipped so that all metric scores rank in the same direction.
  • Figure 3: Behavior of metrics at different CFG values for the generation of samples using the DiT-XL-2 model trained on ImageNet. Two sets of 15 random classes were used (as marked by (1) and (2)), with six runs each performed and averaged. Error bars (±1 SD) are plotted but are visibly negligible. Arrows are put aside each measure's lines - upright arrows mean that higher values indicate better fidelity or diversity (opposite for the downward arrows). See sample images in the Appendix and Figure 1(c).
  • Figure 4: 100 random classes were dropped at a time from the ImageNet ADMG-ADMU generated dataset. The corresponding metric scores are plotted. The scores of 10 runs were averaged and (±1 SD) error bars are plotted, though visually minimal. Similar to Figure 3, arrows are beside each measure's lines, with the orientation of the arrow corresponding to the direction in which better fidelity/diversity is defined.
  • Figure 5: Two-dimensional visualization of the sample set of distributions used for calculations in Figure 6.
  • ...and 2 more figures