Table of Contents
Fetching ...

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, Sanjiv Kumar

TL;DR

The paper challenges the use of Fréchet Inception Distance (FID) as the primary metric for evaluating image generation quality, arguing that its normality assumptions and reliance on Inception features lead to misrankings and poor sensitivity to distortions. It introduces CMMD, a distribution-free, unbiased metric based on CLIP embeddings and Maximum Mean Discrepancy with a Gaussian kernel, designed to be more aligned with human judgments and to require fewer samples. Through extensive experiments on progressive image generation, complex distortions, and human evaluations, CMMD consistently tracks true image quality improvements and degradation better than FID. This work provides a practical, scalable alternative for robust online evaluation and comparison of modern text-to-image models.

Abstract

As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

TL;DR

The paper challenges the use of Fréchet Inception Distance (FID) as the primary metric for evaluating image generation quality, arguing that its normality assumptions and reliance on Inception features lead to misrankings and poor sensitivity to distortions. It introduces CMMD, a distribution-free, unbiased metric based on CLIP embeddings and Maximum Mean Discrepancy with a Gaussian kernel, designed to be more aligned with human judgments and to require fewer samples. Through extensive experiments on progressive image generation, complex distortions, and human evaluations, CMMD consistently tracks true image quality improvements and degradation better than FID. This work provides a practical, scalable alternative for robust online evaluation and comparison of modern text-to-image models.

Abstract

As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.
Paper Structure (16 sections, 6 equations, 8 figures, 4 tables)

This paper contains 16 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Behaviour of FID and CMMD under distortions. CMMD monotonically increases with the distortion level, correctly identifying the degradation in image quality with increasing distortions. FID is wrong. It improves (goes down) for the first few distortion levels, suggesting that quality improves when these more subtle distortions are applied. See Section \ref{['sec:vqgan_distortions']} for details.
  • Figure 2: t-SNE visualization of Inception embeddings of the COCO 30K dataset. Note that even in the reduced-dimensional 2D representation, it is easy to identify that embeddings have multiple modes and do not follow a multivariate normal distribution.
  • Figure 3: The quality of the generated image monotonically improves as we progress through Muse's refinement iterations. CMMD correctly identifies the improvements. FID, however, incorrectly indicates a quality degradation (see Figure \ref{['fig:muse_iterations']}). Prompt: "The Parthenon".
  • Figure 4: Behavior of FID and CMMD for Muse steps. CMMD monotonically goes down, correctly identifying the iterative improvements made to the images (see Figure \ref{['fig:muse_steps_sample']}). FID is completely wrong suggesting degradation in image quality as iterations progress. $\text{FID}_\infty$ has the same behavior as FID.
  • Figure 5: Behavior of FID and CMMD under distortions. Images in the first row (FID: 21.40, CMMD: 0.721) are undistorted. Images in the second (FID: 18.02, CMMD: 1.190) are distorted by randomly replacing each VQGAN token with probability $p=0.2$. The image quality clearly degrades as a result of the distortion, but FID suggests otherwise, while CMMD correctly identifies the degradation.
  • ...and 3 more figures