Table of Contents
Fetching ...

On the Content Bias in Fréchet Video Distance

Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Yan Zhu, Jia-Bin Huang

TL;DR

The paper analyzes why Fréchet Video Distance (FVD) can favor high frame quality over temporal realism, identifying a content bias arising from the features used to compute FVD. By decoupling spatial and temporal quality and by probing the metric's perceptual null space with resampling, the authors show FVD is largely insensitive to temporal artifacts when using traditional I3D features. Replacing these with self-supervised VideoMAE-v2 features substantially mitigates the bias, improving alignment with human perception, especially for motion. The work highlights a need for better video evaluation metrics and demonstrates practical gains from adopting self-supervised features in FVD computations, with implications for long-video generation and out-of-domain content.

Abstract

Fréchet Video Distance (FVD), a prominent metric for evaluating video generation models, is known to conflict with human perception occasionally. In this paper, we aim to explore the extent of FVD's bias toward per-frame quality over temporal realism and identify its sources. We first quantify the FVD's sensitivity to the temporal axis by decoupling the frame and motion quality and find that the FVD increases only slightly with large temporal corruption. We then analyze the generated videos and show that via careful sampling from a large set of generated videos that do not contain motions, one can drastically decrease FVD without improving the temporal quality. Both studies suggest FVD's bias towards the quality of individual frames. We further observe that the bias can be attributed to the features extracted from a supervised video classifier trained on the content-biased dataset. We show that FVD with features extracted from the recent large-scale self-supervised video models is less biased toward image quality. Finally, we revisit a few real-world examples to validate our hypothesis.

On the Content Bias in Fréchet Video Distance

TL;DR

The paper analyzes why Fréchet Video Distance (FVD) can favor high frame quality over temporal realism, identifying a content bias arising from the features used to compute FVD. By decoupling spatial and temporal quality and by probing the metric's perceptual null space with resampling, the authors show FVD is largely insensitive to temporal artifacts when using traditional I3D features. Replacing these with self-supervised VideoMAE-v2 features substantially mitigates the bias, improving alignment with human perception, especially for motion. The work highlights a need for better video evaluation metrics and demonstrates practical gains from adopting self-supervised features in FVD computations, with implications for long-video generation and out-of-domain content.

Abstract

Fréchet Video Distance (FVD), a prominent metric for evaluating video generation models, is known to conflict with human perception occasionally. In this paper, we aim to explore the extent of FVD's bias toward per-frame quality over temporal realism and identify its sources. We first quantify the FVD's sensitivity to the temporal axis by decoupling the frame and motion quality and find that the FVD increases only slightly with large temporal corruption. We then analyze the generated videos and show that via careful sampling from a large set of generated videos that do not contain motions, one can drastically decrease FVD without improving the temporal quality. Both studies suggest FVD's bias towards the quality of individual frames. We further observe that the bias can be attributed to the features extracted from a supervised video classifier trained on the content-biased dataset. We show that FVD with features extracted from the recent large-scale self-supervised video models is less biased toward image quality. Finally, we revisit a few real-world examples to validate our hypothesis.
Paper Structure (16 sections, 2 equations, 11 figures, 7 tables)

This paper contains 16 sections, 2 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: FVD is biased towards per-frame quality than temporal consistency. FVD unterthiner2019fvd, a commonly used video generation evaluation metric, should ideally capture both spatial and temporal aspects. However, our experiments reveal a strong bias toward individual frame quality. (a) First, we apply mild spatial distortions through local warping, which results in an FVD score of 317.10. (b) Next, we induce slightly less spatial corruptions but severe temporal inconsistencies by altering each frame differently. These changes create artifacts that are noticeable to humans and evident in the spatiotemporal x-t slice, as seen in the bottom row, but surprisingly lead to a lower (better) FVD score of 310.52. This discrepancy highlights the metric's bias towards individual frame quality. We encourage readers to view the videos with Acrobat Reader or visit our website to observe the inconsistencies.
  • Figure 2: Analyzing the FVD's sensitivity to temporal consistency. We distort the same set of videos in spatial only or spatiotemporal manners so that the resulting videos have similar frame quality yet only differ in temporal quality. By comparing the FVD scores of the two distorted video sets, we aim at quantifying the temporal sensitivity of the metric.
  • Figure 3: Visualization of the spatial and spatiotemporal corruptions. Both corruptions yield similar frame quality, while the spatiotemporal corruption induces additional temporal inconsistency in the video. By comparing the FVD of the spatiotemporal corruption with the spatial corruption, we analyze the temporal sensitivity of the metric. Best viewed with Acrobat Reader. Please check our website for videos.
  • Figure 4: FVD sensitivity with different video feature extractors. We show that by substituting the I3D features with ones computed from the VideoMAE-v2 model, the temporal sensitivity can be significantly improved for both kinds of distortions.
  • Figure 5: The origin of FVD sensitivity. We show the temporal sensitivity achieved by using VideoMAE features is mainly attributed to the self-supervised objective.
  • ...and 6 more figures