Table of Contents
Fetching ...

Temporal Realism Evaluation of Generated Videos Using Compressed-Domain Motion Vectors

Mert Onur Cakiroglu, Idil Bilge Altun, Zhihe Lu, Mehmet Dalkilic, Hasan Kurban

TL;DR

This paper tackles the gap in assessing temporal realism of generative videos by introducing a scalable framework that leverages compressed-domain motion vectors from codecs like $H.264$ and $HEVC$ to quantify motion realism via distributional divergences such as $KL$, $JS$, and $WD$. It demonstrates that MV statistics—including magnitude sums, entropy, and class-conditional heatmaps—reveal temporal inconsistencies not captured by frame-centric metrics and that MV-RGB fusion can enhance downstream video classification across ResNet, I3D, and TSN backbones. The authors propose and compare multiple fusion strategies (Channel Concatenation, Cross-Attention, Joint Embedding, Motion-Aware Fusion) and show that integrating MVs generally improves discriminability while providing efficient temporal priors. The approach is validated on GenVidBench, where Pika attains the closest alignment to real motion entropy and I3D achieves near-perfect discrimination (up to $99.0\%$), illustrating the practical impact of compressed-domain motion cues for both evaluation and temporal regularization in generation pipelines.

Abstract

Temporal realism remains a central weakness of current generative video models, as most evaluation metrics prioritize spatial appearance and offer limited sensitivity to motion. We introduce a scalable, model-agnostic framework that assesses temporal behavior using motion vectors (MVs) extracted directly from compressed video streams. Codec-generated MVs from standards such as H.264 and HEVC provide lightweight, resolution-consistent descriptors of motion dynamics. We quantify realism by computing Kullback-Leibler, Jensen-Shannon, and Wasserstein divergences between MV statistics of real and generated videos. Experiments on the GenVidBench dataset containing videos from eight state-of-the-art generators reveal systematic discrepancies from real motion: entropy-based divergences rank Pika and SVD as closest to real videos, MV-sum statistics favor VC2 and Text2Video-Zero, and CogVideo shows the largest deviations across both measures. Visualizations of MV fields and class-conditional motion heatmaps further reveal center bias, sparse and piecewise constant flows, and grid-like artifacts that frame-level metrics do not capture. Beyond evaluation, we investigate MV-RGB fusion through channel concatenation, cross-attention, joint embedding, and a motion-aware fusion module. Incorporating MVs improves downstream classification across ResNet, I3D, and TSN backbones, with ResNet-18 and ResNet-34 reaching up to 97.4% accuracy and I3D achieving 99.0% accuracy on real-versus-generated discrimination. These findings demonstrate that compressed-domain MVs provide an effective temporal signal for diagnosing motion defects in generative videos and for strengthening temporal reasoning in discriminative models. The implementation is available at: https://github.com/KurbanIntelligenceLab/Motion-Vector-Learning

Temporal Realism Evaluation of Generated Videos Using Compressed-Domain Motion Vectors

TL;DR

This paper tackles the gap in assessing temporal realism of generative videos by introducing a scalable framework that leverages compressed-domain motion vectors from codecs like and to quantify motion realism via distributional divergences such as , , and . It demonstrates that MV statistics—including magnitude sums, entropy, and class-conditional heatmaps—reveal temporal inconsistencies not captured by frame-centric metrics and that MV-RGB fusion can enhance downstream video classification across ResNet, I3D, and TSN backbones. The authors propose and compare multiple fusion strategies (Channel Concatenation, Cross-Attention, Joint Embedding, Motion-Aware Fusion) and show that integrating MVs generally improves discriminability while providing efficient temporal priors. The approach is validated on GenVidBench, where Pika attains the closest alignment to real motion entropy and I3D achieves near-perfect discrimination (up to ), illustrating the practical impact of compressed-domain motion cues for both evaluation and temporal regularization in generation pipelines.

Abstract

Temporal realism remains a central weakness of current generative video models, as most evaluation metrics prioritize spatial appearance and offer limited sensitivity to motion. We introduce a scalable, model-agnostic framework that assesses temporal behavior using motion vectors (MVs) extracted directly from compressed video streams. Codec-generated MVs from standards such as H.264 and HEVC provide lightweight, resolution-consistent descriptors of motion dynamics. We quantify realism by computing Kullback-Leibler, Jensen-Shannon, and Wasserstein divergences between MV statistics of real and generated videos. Experiments on the GenVidBench dataset containing videos from eight state-of-the-art generators reveal systematic discrepancies from real motion: entropy-based divergences rank Pika and SVD as closest to real videos, MV-sum statistics favor VC2 and Text2Video-Zero, and CogVideo shows the largest deviations across both measures. Visualizations of MV fields and class-conditional motion heatmaps further reveal center bias, sparse and piecewise constant flows, and grid-like artifacts that frame-level metrics do not capture. Beyond evaluation, we investigate MV-RGB fusion through channel concatenation, cross-attention, joint embedding, and a motion-aware fusion module. Incorporating MVs improves downstream classification across ResNet, I3D, and TSN backbones, with ResNet-18 and ResNet-34 reaching up to 97.4% accuracy and I3D achieving 99.0% accuracy on real-versus-generated discrimination. These findings demonstrate that compressed-domain MVs provide an effective temporal signal for diagnosing motion defects in generative videos and for strengthening temporal reasoning in discriminative models. The implementation is available at: https://github.com/KurbanIntelligenceLab/Motion-Vector-Learning

Paper Structure

This paper contains 27 sections, 39 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the proposed compressed-domain motion framework. (I) Compressed-domain MVs are obtained from standard video codecs using block matching on macroblocks, for example $16{\times}16$. Frame A illustrates a macroblock, and Frame B shows the search process and the selected best-matching block, which forms the MV between frames. (II) MV fields are spatially aligned with RGB frames to produce consistent motion representations. We consider multi-modal fusion paradigms, including channel concatenation, joint embedding, cross-attention, and motion-aware fusion. In motion-aware fusion, motion masks modulate the contribution of motion features to appearance features, emphasizing informative motion while suppressing noise.
  • Figure 2: (TOP) Motion statistics of real and generated videos. Videos are temporally interpolated to a common length, after which per-frame averages are computed to visualize how MV sums and motion entropy evolve over time. This view emphasizes temporal dynamics and highlights deviations in frame-to-frame consistency. Temporal ordering is disregarded, and all frames across the dataset are aggregated as independent samples, producing overall distributions of motion vector magnitudes and motion entropy. This representation captures the global statistical alignment of motion features while ignoring sequential context.(MIDDLE) Representative flow fields per class. Arrows indicate the direction of $V_{t^\star}(u,v)$ and their lengths encode relative magnitude after panel-wise normalization; fields are resized to a common grid and downsampled to a coarse lattice for clarity. More information is given in the Appendix section \ref{['subsec:flowfields']}. (BOTTOM) Class-conditional motion-magnitude heatmaps. Intensity encodes the expected per-pixel $\ell_2$ magnitude $\bar{h}_c(u,v)$, averaged over frames and clips on a $56\times56$ grid; maps are min–max rescaled to $[0,1]$ for visualization. More information is given in the Appendix section \ref{['subsec:heatmap']}
  • Figure 3: Normalized divergences of motion statistics for all models relative to real videos. Panel (a) reports divergences computed from motion vector magnitude metrics and panel (b) shows divergences based on motion entropy metrics. For each model, values are normalized per metric to highlight relative discrepancies across motion characteristics. Lower values indicate closer alignment with real video dynamics, while higher values reflect stronger deviations in temporal realism. Pika aligns best with real motion entropy; VC2 and T2VZ achieve lowest divergences for motion-vector sums. Full results are in Appendix \ref{['subsec:distmet']}.
  • Figure 4: Regional motion-energy profiles on a $4\times4$ grid. Each cell encodes the expected per-region magnitude $\bar{s}_c(r,c)$ (and/or its normalized share $p_c(r,c)$), averaged over frames and clips within each class; higher values indicate regions where motion energy concentrates.
  • Figure 5: Class-conditional distributions of clip-level motion descriptors. (a) Mean motion magnitude $M_i$ per clip, displayed on a log scale. (b) Motion entropy $H_i$ per clip, averaging per-frame Shannon entropy of the magnitude field. Together, these summarize the strength and spatial complexity of motion across classes.
  • ...and 1 more figures