Table of Contents
Fetching ...

Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos

Jiahe Liu, Youran Qu, Qi Yan, Xiaohui Zeng, Lele Wang, Renjie Liao

TL;DR

This work introduces FVMD, a motion-focused metric for evaluating generated videos by tracking key point trajectories to extract velocity and acceleration features and comparing their distributions to ground-truth videos via the Fréchet distance. The method is validated through sanity checks, sensitivity analyses, and large-scale human studies, showing stronger alignment with human judgments than existing metrics like FVD, FID-VID, and VBench. Additionally, incorporating the motion features improves unary video quality assessment (VQA) models, suggesting broad applicability beyond pairwise video evaluation. The results indicate that FVMD provides a more faithful measure of temporal motion quality and has practical implications for improving video generation and evaluation pipelines.

Abstract

Significant advancements have been made in video generative models recently. Unlike image generation, video generation presents greater challenges, requiring not only generating high-quality frames but also ensuring temporal consistency across these frames. Despite the impressive progress, research on metrics for evaluating the quality of generated videos, especially concerning temporal and motion consistency, remains underexplored. To bridge this research gap, we propose Fréchet Video Motion Distance (FVMD) metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key point tracking, and then measure the similarity between these features via the Fréchet distance. We conduct sensitivity analysis by injecting noise into real videos to verify the effectiveness of FVMD. Further, we carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics. Additionally, our motion features can consistently improve the performance of Video Quality Assessment (VQA) models, indicating that our approach is also applicable to unary video quality evaluation. Code is available at https://github.com/ljh0v0/FMD-frechet-motion-distance.

Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos

TL;DR

This work introduces FVMD, a motion-focused metric for evaluating generated videos by tracking key point trajectories to extract velocity and acceleration features and comparing their distributions to ground-truth videos via the Fréchet distance. The method is validated through sanity checks, sensitivity analyses, and large-scale human studies, showing stronger alignment with human judgments than existing metrics like FVD, FID-VID, and VBench. Additionally, incorporating the motion features improves unary video quality assessment (VQA) models, suggesting broad applicability beyond pairwise video evaluation. The results indicate that FVMD provides a more faithful measure of temporal motion quality and has practical implications for improving video generation and evaluation pipelines.

Abstract

Significant advancements have been made in video generative models recently. Unlike image generation, video generation presents greater challenges, requiring not only generating high-quality frames but also ensuring temporal consistency across these frames. Despite the impressive progress, research on metrics for evaluating the quality of generated videos, especially concerning temporal and motion consistency, remains underexplored. To bridge this research gap, we propose Fréchet Video Motion Distance (FVMD) metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key point tracking, and then measure the similarity between these features via the Fréchet distance. We conduct sensitivity analysis by injecting noise into real videos to verify the effectiveness of FVMD. Further, we carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics. Additionally, our motion features can consistently improve the performance of Video Quality Assessment (VQA) models, indicating that our approach is also applicable to unary video quality evaluation. Code is available at https://github.com/ljh0v0/FMD-frechet-motion-distance.
Paper Structure (18 sections, 5 equations, 6 figures, 7 tables)

This paper contains 18 sections, 5 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison of the fidelity of different video evaluation metrics.Top: we present videos generated by various models trained on the TikTok dataset jafarian2022self, ranked according to the human ratings in the user study. Bottom: we show quantitative scores and relative ranking given by our FVMD and other widely-used metrics, including FVD unterthiner2018towards, FID-VID balaji2019conditional, and VBench huang2023vbench. The correlations are computed using the Pearson correlation coefficient with human scores (detailed in \ref{['subsec:human']}). Our FVMD achieves the best correlation with human judgment among all the metrics and clearly distinguishes video samples of different quality.
  • Figure 2: The overall pipeline of our proposed Fréchet Video Motion Distance (FVMD). Our pipeline first tracks video key point trajectories using the pre-trained PIPs++ zheng2023pointodyssey model and computes the velocity and acceleration fields for each frame. The motion features are then derived from the histograms of the quantized velocity and acceleration. FVMD is eventually given by the Fréchet distance between the motion features of generated and ground-truth videos.
  • Figure 3: Sanity check experiments. We use dense 1D histograms based on velocity, acceleration, and their concatenated combination to construct FVMD metrics. As sample size increases, same-dataset discrepancies (BAIR vs BAIR) converge to zero, while cross-dataset discrepancies (TIKTOK vs BAIR) remain large, verifying the soundness of our FVMD metric.
  • Figure 4: Sensitivity analysis. We present the FVMD results in the presence of various temporal noises. FVMD based on combined velocity and acceleration features shows the most reliable performance in distinguishing temporal inconsistencies.
  • Figure 5: Sanity check. We visualize the curve for FVMD with quantized 2D histogram versus the number of samples.
  • ...and 1 more figures