Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

Ge Ya Luo; Gian Mario Favero; Zhi Hao Luo; Alexia Jolicoeur-Martineau; Christopher Pal

Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

Ge Ya Luo, Gian Mario Favero, Zhi Hao Luo, Alexia Jolicoeur-Martineau, Christopher Pal

TL;DR

JEDi, the JEPA Embedding Distance is proposed, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel, showing clear evidence that it is a superior alternative to the widely used FVD metric.

Abstract

The Fréchet Video Distance (FVD) is a widely adopted metric for evaluating video generation distribution quality. However, its effectiveness relies on critical assumptions. Our analysis reveals three significant limitations: (1) the non-Gaussianity of the Inflated 3D Convnet (I3D) feature space; (2) the insensitivity of I3D features to temporal distortions; (3) the impractical sample sizes required for reliable estimation. These findings undermine FVD's reliability and show that FVD falls short as a standalone metric for video generation evaluation. After extensive analysis of a wide range of metrics and backbone architectures, we propose JEDi, the JEPA Embedding Distance, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel. Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average.

Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

TL;DR

Abstract

Paper Structure (39 sections, 11 equations, 22 figures, 4 tables, 2 algorithms)

This paper contains 39 sections, 11 equations, 22 figures, 4 tables, 2 algorithms.

Introduction
Background and Notations
Video Feature Representation
Fréchet Distance and Fréchet Video Distance
Other Distribution Distance Metrics
Examining FVD: Feature Spaces and the Gaussianity Assumption
The Dual Challenge of Convergence: High-Dimensional Feature Spaces and Limited Samples
Challenge #1: The Curse of Dimensionality
Challenge #2: Sample Efficiency and Data Scarcity
Metric Distance Analysis: Noise, Generative Models, and Human Study
Noise & Generation Models and Their Impacts on Metric Measurement
Metric Robustness Assessment With Progressive Distortion level and Training Duration
Sample Efficiency Under Noise Distortion
Human Evaluation
Conclusion
...and 24 more sections

Figures (22)

Figure 1: Comparing the number of samples that Fréchet Distance (FD), Energy, and $\text{MMD}_{\text{POLY}}$ need to converge against its alignment with human evaluation on the UCF-101 dataset. JEDi, the feature space of a V-JEPA model ($\text{V-JEPA}_{\text{SSv2}}$) in combination with a Maximum Mean Discrepancy (MMD) metric, is a vastly more efficient framework for evaluating distributions of generated videos than conventional methods. The current standard, FVD (FD+I3D), underperforms in terms of both sample efficiency and alignment with human evaluation.
Figure 2: The dimensionally reduced video features of the 11 datasets using LDA and PCA indicate that the video features are non-Gaussian in the combined dataset space. While individual dataset clusters may appear Gaussian in these plots, the low explained variance ratios (0.134-0.231) of the PCA-reduced spaces suggest that 2D projections in these plots may not capture the complexity of higher-dimensional feature distributions within individual datasets. Figures \ref{['fig:pca_individual']} and \ref{['fig:pca_individual_vjepa_pt']} contain dataset-specific LDA and PCA plots, which reveal non-Gaussian characteristics within the datasets.
Figure 3: The number of samples needed to achieve a 5% error margin of the distance measured from 5,000 samples using the training and testing sets of UCF-101. An "_ae" suffix indicates that the feature space has been compressed using an autoencoder. We assess the number of samples required for convergence at 100 sample intervals. Convergence at sample size $N$ is achieved if: (1) the average metric value from 5 repeated samplings of $N$ features falls within a 5% error margin, and (2) all subsequent interval evaluations maintain an average metric value within the 5% error margin. $\text{VideoMAE}_{\text{PT}}$ and $\text{V-JEPA}_{\text{PT}}$ results are in the Appendix (Figure \ref{['fig:number_sample_convergence_others']}). We find that Fréchet Distance (FD) converges slowest, while $\text{MMD}_{\text{POLY}}$ shows the highest sample efficiency.
Figure 4: How metric distance changes as temporal blur increases. Specifically, temporal blur distortion is controlled by varying the sigma range ($\sigma$) using the distortion level ($\lambda$), with $\sigma=[0.1-0.01\lambda, 0.75+0.8\lambda]$. The study is carried out on the UCF-101 dataset.
Figure 5: Ctrl-V is fine-tuned on BDD. Visual inspection show incremental improvements in generation quality at each training step. This is captured by, JEDi ($\text{V-JEPA}_{\text{SSv2}}$+$\text{MMD}_{\text{POLY}}$). However, FVD (I3D+FD), $\text{VideoMAE}_{\text{SSv2}}$+$\text{MMD}_{\text{POLY}}$ and $\text{V-JEPA}_{\text{PT}}$+$\text{MMD}_{\text{POLY}}$ fail to detect incremental improvements. The Spearman coefficient correlation values for the X and Y axes are -1, -0.6, -0.9 and -0.8 for JEDi, FVD, $\text{VideoMAE}_{\text{SSv2}}$+$\text{MMD}_{\text{POLY}}$ and $\text{V-JEPA}_{\text{PT}}$+$\text{MMD}_{\text{POLY}}$, respectively, with only JEDi showing statistical significance.
...and 17 more figures

Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

TL;DR

Abstract

Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

Authors

TL;DR

Abstract

Table of Contents

Figures (22)