Temporal Realism Evaluation of Generated Videos Using Compressed-Domain Motion Vectors
Mert Onur Cakiroglu, Idil Bilge Altun, Zhihe Lu, Mehmet Dalkilic, Hasan Kurban
TL;DR
This paper tackles the gap in assessing temporal realism of generative videos by introducing a scalable framework that leverages compressed-domain motion vectors from codecs like $H.264$ and $HEVC$ to quantify motion realism via distributional divergences such as $KL$, $JS$, and $WD$. It demonstrates that MV statistics—including magnitude sums, entropy, and class-conditional heatmaps—reveal temporal inconsistencies not captured by frame-centric metrics and that MV-RGB fusion can enhance downstream video classification across ResNet, I3D, and TSN backbones. The authors propose and compare multiple fusion strategies (Channel Concatenation, Cross-Attention, Joint Embedding, Motion-Aware Fusion) and show that integrating MVs generally improves discriminability while providing efficient temporal priors. The approach is validated on GenVidBench, where Pika attains the closest alignment to real motion entropy and I3D achieves near-perfect discrimination (up to $99.0\%$), illustrating the practical impact of compressed-domain motion cues for both evaluation and temporal regularization in generation pipelines.
Abstract
Temporal realism remains a central weakness of current generative video models, as most evaluation metrics prioritize spatial appearance and offer limited sensitivity to motion. We introduce a scalable, model-agnostic framework that assesses temporal behavior using motion vectors (MVs) extracted directly from compressed video streams. Codec-generated MVs from standards such as H.264 and HEVC provide lightweight, resolution-consistent descriptors of motion dynamics. We quantify realism by computing Kullback-Leibler, Jensen-Shannon, and Wasserstein divergences between MV statistics of real and generated videos. Experiments on the GenVidBench dataset containing videos from eight state-of-the-art generators reveal systematic discrepancies from real motion: entropy-based divergences rank Pika and SVD as closest to real videos, MV-sum statistics favor VC2 and Text2Video-Zero, and CogVideo shows the largest deviations across both measures. Visualizations of MV fields and class-conditional motion heatmaps further reveal center bias, sparse and piecewise constant flows, and grid-like artifacts that frame-level metrics do not capture. Beyond evaluation, we investigate MV-RGB fusion through channel concatenation, cross-attention, joint embedding, and a motion-aware fusion module. Incorporating MVs improves downstream classification across ResNet, I3D, and TSN backbones, with ResNet-18 and ResNet-34 reaching up to 97.4% accuracy and I3D achieving 99.0% accuracy on real-versus-generated discrimination. These findings demonstrate that compressed-domain MVs provide an effective temporal signal for diagnosing motion defects in generative videos and for strengthening temporal reasoning in discriminative models. The implementation is available at: https://github.com/KurbanIntelligenceLab/Motion-Vector-Learning
