Table of Contents
Fetching ...

Normalizing Flow-Based Metric for Image Generation

Pranav Jeevan, Neeraj Nixon, Amit Sethi

Abstract

We propose two new evaluation metrics to assess realness of generated images based on normalizing flows: a simpler and efficient flow-based likelihood distance (FLD) and a more exact dual-flow based likelihood distance (D-FLD). Because normalizing flows can be used to compute the exact likelihood, the proposed metrics assess how closely generated images align with the distribution of real images from a given domain. This property gives the proposed metrics a few advantages over the widely used Fréchet inception distance (FID) and other recent metrics. Firstly, the proposed metrics need only a few hundred images to stabilize (converge in mean), as opposed to tens of thousands needed for FID, and at least a few thousand for the other metrics. This allows confident evaluation of even small sets of generated images, such as validation batches inside training loops. Secondly, the network used to compute the proposed metric has over an order of magnitude fewer parameters compared to Inception-V3 used to compute FID, making it computationally more efficient. For assessing the realness of generated images in new domains (e.g., x-ray images), ideally these networks should be retrained on real images to model their distinct distributions. Thus, our smaller network will be even more advantageous for new domains. Extensive experiments show that the proposed metrics have the desired monotonic relationships with the extent of image degradation of various kinds.

Normalizing Flow-Based Metric for Image Generation

Abstract

We propose two new evaluation metrics to assess realness of generated images based on normalizing flows: a simpler and efficient flow-based likelihood distance (FLD) and a more exact dual-flow based likelihood distance (D-FLD). Because normalizing flows can be used to compute the exact likelihood, the proposed metrics assess how closely generated images align with the distribution of real images from a given domain. This property gives the proposed metrics a few advantages over the widely used Fréchet inception distance (FID) and other recent metrics. Firstly, the proposed metrics need only a few hundred images to stabilize (converge in mean), as opposed to tens of thousands needed for FID, and at least a few thousand for the other metrics. This allows confident evaluation of even small sets of generated images, such as validation batches inside training loops. Secondly, the network used to compute the proposed metric has over an order of magnitude fewer parameters compared to Inception-V3 used to compute FID, making it computationally more efficient. For assessing the realness of generated images in new domains (e.g., x-ray images), ideally these networks should be retrained on real images to model their distinct distributions. Thus, our smaller network will be even more advantageous for new domains. Extensive experiments show that the proposed metrics have the desired monotonic relationships with the extent of image degradation of various kinds.
Paper Structure (19 sections, 7 equations, 16 figures, 1 table)

This paper contains 19 sections, 7 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Mean values of FLD and FID by number of generated images clearly demonstrate that FLD bas much better sample efficiency as it achieves reliable results with fewer than 200 samples, whereas FID requires over 20,000 samples to capture a reliable mean score.
  • Figure 2: Visual interpretation of D-FLD metric: The figure visualizes two sets of probability density functions estimated using normalizing flows. The blue curve represents the distribution of the normalizing flow trained on real data, while the red curve represents the distribution trained on generated data. In the top graph, the distributions are relatively similar, resulting in smaller differences in likelihood values $d_1$ and $d_2$ for two images $x_1$ and $x_2$ respectively. Thus, lower difference in likelihood values indicates a closer match between the real and generated data distributions, especially when averaged over the entire image space. In contrast, the bottom graph shows more dissimilar distributions, leading to larger difference $d_1$ and $d_2$ in likelihood values, demonstrating that as the dissimilarity between the distributions increases, the average likelihood difference across the image space also increases. This highlights how closely aligned probability densities lead to smaller likelihood differences, while greater divergence results in larger likelihood differences across the image space. Thus, difference in likelihood can be used as a good metric for evaluating the distance between two data distributions.
  • Figure 3: The process of computing D-FLD using two normalizing flows. In the training phase, two normalizing flows are independently trained on real and generated images. In the evaluation phase, each image is passed through both flows, and the absolute difference between the log-likelihoods of both flow is computed, averaged over all images and then transformed to produce the final metric.
  • Figure 4: Visual interpretation of FLD metric: The figure illustrates the real image data distribution estimated by a normalizing flow (shown in blue), and the generated image distribution (shown as dotted red lines), which is not modeled by a normalizing flow. When we take generated images, most of them will belong to regions with high likelihood in the generated distribution, as indicated by points $x_1$ and $x_2$. In the top figure, we demonstrate a scenario where the real data distribution and the generated distribution are closely aligned and similar. We see that the likelihood of the generated images with respect to the real data distribution, $L_1$ and $L_2$ will be very high, emphasizing the strong similarity between the distributions. When we average over all the generated images, the average likelihood value will also be higher. Similarly, in the bottom image, when the real distribution and the generated distribution are dissimilar or further apart, the likelihood of the generated images with respect to the real distribution is significantly lower. As we evaluate and average the likelihood over all generated images, this scenario results in a much lower overall likelihood value compared to the case where the real and generated distributions are closely aligned, as depicted in the top image. This highlights the increased distance between the distributions and its impact on the likelihood evaluation. Hence, the likelihood of generated images with respect to real data distribution can be used as a good metric for evaluating the distance between two data distributions.
  • Figure 5: The process of computing FLD using a single normalizing flow. In the training phase, a normalizing flow is trained on real images. In the evaluation phase, real and generated images are passed through the flow, and their log-likelihoods are computed and averaged separately and their ratio is the final metric.
  • ...and 11 more figures