Table of Contents
Fetching ...

Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

Ge Ya Luo, Gian Mario Favero, Zhi Hao Luo, Alexia Jolicoeur-Martineau, Christopher Pal

TL;DR

JEDi, the JEPA Embedding Distance is proposed, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel, showing clear evidence that it is a superior alternative to the widely used FVD metric.

Abstract

The Fréchet Video Distance (FVD) is a widely adopted metric for evaluating video generation distribution quality. However, its effectiveness relies on critical assumptions. Our analysis reveals three significant limitations: (1) the non-Gaussianity of the Inflated 3D Convnet (I3D) feature space; (2) the insensitivity of I3D features to temporal distortions; (3) the impractical sample sizes required for reliable estimation. These findings undermine FVD's reliability and show that FVD falls short as a standalone metric for video generation evaluation. After extensive analysis of a wide range of metrics and backbone architectures, we propose JEDi, the JEPA Embedding Distance, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel. Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average.

Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

TL;DR

JEDi, the JEPA Embedding Distance is proposed, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel, showing clear evidence that it is a superior alternative to the widely used FVD metric.

Abstract

The Fréchet Video Distance (FVD) is a widely adopted metric for evaluating video generation distribution quality. However, its effectiveness relies on critical assumptions. Our analysis reveals three significant limitations: (1) the non-Gaussianity of the Inflated 3D Convnet (I3D) feature space; (2) the insensitivity of I3D features to temporal distortions; (3) the impractical sample sizes required for reliable estimation. These findings undermine FVD's reliability and show that FVD falls short as a standalone metric for video generation evaluation. After extensive analysis of a wide range of metrics and backbone architectures, we propose JEDi, the JEPA Embedding Distance, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel. Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average.
Paper Structure (39 sections, 11 equations, 22 figures, 4 tables, 2 algorithms)

This paper contains 39 sections, 11 equations, 22 figures, 4 tables, 2 algorithms.

Figures (22)

  • Figure 1: Comparing the number of samples that Fréchet Distance (FD), Energy, and $\text{MMD}_{\text{POLY}}$ need to converge against its alignment with human evaluation on the UCF-101 dataset. JEDi, the feature space of a V-JEPA model ($\text{V-JEPA}_{\text{SSv2}}$) in combination with a Maximum Mean Discrepancy (MMD) metric, is a vastly more efficient framework for evaluating distributions of generated videos than conventional methods. The current standard, FVD (FD+I3D), underperforms in terms of both sample efficiency and alignment with human evaluation.
  • Figure 2: The dimensionally reduced video features of the 11 datasets using LDA and PCA indicate that the video features are non-Gaussian in the combined dataset space. While individual dataset clusters may appear Gaussian in these plots, the low explained variance ratios (0.134-0.231) of the PCA-reduced spaces suggest that 2D projections in these plots may not capture the complexity of higher-dimensional feature distributions within individual datasets. Figures \ref{['fig:pca_individual']} and \ref{['fig:pca_individual_vjepa_pt']} contain dataset-specific LDA and PCA plots, which reveal non-Gaussian characteristics within the datasets.
  • Figure 3: The number of samples needed to achieve a 5% error margin of the distance measured from 5,000 samples using the training and testing sets of UCF-101. An "_ae" suffix indicates that the feature space has been compressed using an autoencoder. We assess the number of samples required for convergence at 100 sample intervals. Convergence at sample size $N$ is achieved if: (1) the average metric value from 5 repeated samplings of $N$ features falls within a 5% error margin, and (2) all subsequent interval evaluations maintain an average metric value within the 5% error margin. $\text{VideoMAE}_{\text{PT}}$ and $\text{V-JEPA}_{\text{PT}}$ results are in the Appendix (Figure \ref{['fig:number_sample_convergence_others']}). We find that Fréchet Distance (FD) converges slowest, while $\text{MMD}_{\text{POLY}}$ shows the highest sample efficiency.
  • Figure 4: How metric distance changes as temporal blur increases. Specifically, temporal blur distortion is controlled by varying the sigma range ($\sigma$) using the distortion level ($\lambda$), with $\sigma=[0.1-0.01\lambda, 0.75+0.8\lambda]$. The study is carried out on the UCF-101 dataset.
  • Figure 5: Ctrl-V is fine-tuned on BDD. Visual inspection show incremental improvements in generation quality at each training step. This is captured by, JEDi ($\text{V-JEPA}_{\text{SSv2}}$+$\text{MMD}_{\text{POLY}}$). However, FVD (I3D+FD), $\text{VideoMAE}_{\text{SSv2}}$+$\text{MMD}_{\text{POLY}}$ and $\text{V-JEPA}_{\text{PT}}$+$\text{MMD}_{\text{POLY}}$ fail to detect incremental improvements. The Spearman coefficient correlation values for the X and Y axes are -1, -0.6, -0.9 and -0.8 for JEDi, FVD, $\text{VideoMAE}_{\text{SSv2}}$+$\text{MMD}_{\text{POLY}}$ and $\text{V-JEPA}_{\text{PT}}$+$\text{MMD}_{\text{POLY}}$, respectively, with only JEDi showing statistical significance.
  • ...and 17 more figures