Table of Contents
Fetching ...

FLD+: Data-efficient Evaluation Metric for Generative Models

Pranav Jeevan, Neeraj Nixon, Amit Sethi

TL;DR

The proposed Flow-based Likelihood Distance Plus (FLD+) metric exhibits strongly monotonic behavior with respect to different types of image degradations, including noise, occlusion, diffusion steps, and generative model size.

Abstract

We introduce a new metric to assess the quality of generated images that is more reliable, data-efficient, compute-efficient, and adaptable to new domains than the previous metrics, such as Fréchet Inception Distance (FID). The proposed metric is based on normalizing flows, which allows for the computation of density (exact log-likelihood) of images from any domain. Thus, unlike FID, the proposed Flow-based Likelihood Distance Plus (FLD+) metric exhibits strongly monotonic behavior with respect to different types of image degradations, including noise, occlusion, diffusion steps, and generative model size. Additionally, because normalizing flow can be trained stably and efficiently, FLD+ achieves stable results with two orders of magnitude fewer images than FID (which requires more images to reliably compute Fréchet distance between features of large samples of real and generated images). We made FLD+ computationally even more efficient by applying normalizing flows to features extracted in a lower-dimensional latent space instead of using a pre-trained network. We also show that FLD+ can easily be retrained on new domains, such as medical images, unlike the networks behind previous metrics -- such as InceptionNetV3 pre-trained on ImageNet.

FLD+: Data-efficient Evaluation Metric for Generative Models

TL;DR

The proposed Flow-based Likelihood Distance Plus (FLD+) metric exhibits strongly monotonic behavior with respect to different types of image degradations, including noise, occlusion, diffusion steps, and generative model size.

Abstract

We introduce a new metric to assess the quality of generated images that is more reliable, data-efficient, compute-efficient, and adaptable to new domains than the previous metrics, such as Fréchet Inception Distance (FID). The proposed metric is based on normalizing flows, which allows for the computation of density (exact log-likelihood) of images from any domain. Thus, unlike FID, the proposed Flow-based Likelihood Distance Plus (FLD+) metric exhibits strongly monotonic behavior with respect to different types of image degradations, including noise, occlusion, diffusion steps, and generative model size. Additionally, because normalizing flow can be trained stably and efficiently, FLD+ achieves stable results with two orders of magnitude fewer images than FID (which requires more images to reliably compute Fréchet distance between features of large samples of real and generated images). We made FLD+ computationally even more efficient by applying normalizing flows to features extracted in a lower-dimensional latent space instead of using a pre-trained network. We also show that FLD+ can easily be retrained on new domains, such as medical images, unlike the networks behind previous metrics -- such as InceptionNetV3 pre-trained on ImageNet.

Paper Structure

This paper contains 17 sections, 7 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: The visual interpretation of the FLD+ metric. The real data distribution (blue curve) is modeled by the flow model, and the generated image distribution (dotted red curve) is not modeled by it. For generated images, most data points will fall within high-likelihood regions of the generated distribution, as shown by points $g_1$ and $g_2$. In scenarios where the real and generated distributions are closely aligned (top figure), the likelihood of generated images with respect to the real distribution, denoted as $L_g$, will be high, nearly matching the likelihood of real images, $L_r$, reflecting strong distributional similarity. When averaging over all generated images, the average likelihood is correspondingly higher. Conversely, when the real and generated distributions are dissimilar or more distant (bottom figure), the likelihood of generated images $L_g$ relative to the real distribution significantly decreases. As we average over all generated images in this case, the result is a notably lower overall likelihood compared to the case of aligned distributions, illustrating the increased distance between them and its impact on likelihood evaluation. Therefore, the likelihood of generated images with respect to the real data distribution serves as an effective metric for assessing the distance between two data distributions.
  • Figure 2: To compute FLD+, we start by training the flow model on real images. In the training phase, real images are processed through a frozen pre-trained vision backbone, where activations from the penultimate layer undergo average pooling and are then flattened before being passed to the normalizing flow. During the evaluation phase, both real and generated images are input into the flow model. Their log-likelihoods are calculated, averaged separately, and then the ratio of these averages is computed. Finally, this ratio is exponentiated to obtain the FLD+ metric.
  • Figure 3: Monotonicity and robustness of FLD+ with increasing levels of Gaussian noise applied to images are shown (top). Images with progressively higher levels of Gaussian noise, displayed from left to right, are shown below with corresponding noise values, $\alpha$, indicated beneath each image (bottom).
  • Figure 4: Monotonicity and robustness of FLD+ with increasing levels of Gaussian blur applied to images are shown (top). Images with progressively higher levels of Gaussian blur, displayed from left to right, have corresponding values of kernel size used for creating the blur, indicated below each image (bottom).
  • Figure 5: Monotonicity and robustness of FLD+ with increasing levels of salt-and-pepper noise applied to images are shown (top). Images with progressively higher levels of salt-and-pepper noise, displayed from left to right, have corresponding probability values that controls the proportion of pixels that will be corrupted, indicated below each image (bottom).
  • ...and 5 more figures