Table of Contents
Fetching ...

DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis

Nithin C. Babu, Aniruddha Mahapatra, Harsh Rangwani, Rajiv Soundararajan, Kuldeep Kulkarni

TL;DR

DynamicEval addresses the gap in evaluating T2V models under dynamic camera motion by introducing a prompt suite and a large-scale, video-level human-annotated dataset. It presents two fine-grained metrics, MS-Debias for background consistency and Track-FG for foreground object consistency, to achieve stronger alignment with human preferences than prior baselines. The approach demonstrates improved correlation with human judgments at both video and model levels and provides scalable, interpretable, pixel-level evaluation signals for dynamic T2V generation. Overall, DynamicEval offers a comprehensive framework and dataset for advancing dynamic T2V evaluation and guiding development of better-quality video generation under motion.

Abstract

Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two limitations. (i) While the emphasis is on subject-centric prompts or static camera scenes, camera motion essential for producing cinematic shots and existing metrics under dynamic motion are largely unexplored. (ii) These benchmarks typically aggregate video-level scores into a single model-level score for ranking generative models. Such aggregation, however, overlook video-level evaluation, which is vital to selecting the better video among the candidate videos generated for a given prompt. To address these gaps, we introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion, paired with 45k human annotations on video pairs from 3k videos generated by ten T2V models. DynamicEval evaluates two key dimensions of video quality: background scene consistency and foreground object consistency. For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric. We observe that while the Vbench motion smoothness metric shows promising alignment with human judgments, it fails in two cases: occlusions/disocclusions arising from camera and foreground object movements. Building on this, we propose a new background consistency metric that leverages object error maps to correct two failure cases in a principled manner. Our second innovation is the introduction of a foreground consistency metric that tracks points and their neighbors within each object instance to assess object fidelity. Extensive experiments demonstrate that our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level (an improvement of more than 2% points), establishing DynamicEval as a more comprehensive benchmark for evaluating T2V models under dynamic camera motion.

DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis

TL;DR

DynamicEval addresses the gap in evaluating T2V models under dynamic camera motion by introducing a prompt suite and a large-scale, video-level human-annotated dataset. It presents two fine-grained metrics, MS-Debias for background consistency and Track-FG for foreground object consistency, to achieve stronger alignment with human preferences than prior baselines. The approach demonstrates improved correlation with human judgments at both video and model levels and provides scalable, interpretable, pixel-level evaluation signals for dynamic T2V generation. Overall, DynamicEval offers a comprehensive framework and dataset for advancing dynamic T2V evaluation and guiding development of better-quality video generation under motion.

Abstract

Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two limitations. (i) While the emphasis is on subject-centric prompts or static camera scenes, camera motion essential for producing cinematic shots and existing metrics under dynamic motion are largely unexplored. (ii) These benchmarks typically aggregate video-level scores into a single model-level score for ranking generative models. Such aggregation, however, overlook video-level evaluation, which is vital to selecting the better video among the candidate videos generated for a given prompt. To address these gaps, we introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion, paired with 45k human annotations on video pairs from 3k videos generated by ten T2V models. DynamicEval evaluates two key dimensions of video quality: background scene consistency and foreground object consistency. For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric. We observe that while the Vbench motion smoothness metric shows promising alignment with human judgments, it fails in two cases: occlusions/disocclusions arising from camera and foreground object movements. Building on this, we propose a new background consistency metric that leverages object error maps to correct two failure cases in a principled manner. Our second innovation is the introduction of a foreground consistency metric that tracks points and their neighbors within each object instance to assess object fidelity. Extensive experiments demonstrate that our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level (an improvement of more than 2% points), establishing DynamicEval as a more comprehensive benchmark for evaluating T2V models under dynamic camera motion.

Paper Structure

This paper contains 43 sections, 3 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Prompt curation: Scene elements from databases (orange) are sampled into a metadata (JSON format), which GPT-4o converts into descriptive prompts. Dataset: Video pairs generated from a common prompt are annotated via a subjective study.
  • Figure 2: Dataset analysis: (a) shows the average win rates for both evaluation dimensions. (b) illustrates the percentage of video samples in each model that are static and dynamic.
  • Figure 3: Motion Smoothness error maps: The zoomed in regions show localized distortions visible across frames. VB-MS shows large errors near edges and foreground objects, suppressing the localized distortions. After debiasing, the localized distortions are visible.
  • Figure 4: MS-Debias obtains debiased motion smoothness error maps by masking out foreground objects and occlusions. It is applied at multiple scales with Gaussian pyramid downsampling.
  • Figure 5: Deviation of neighbor tracks on real video vs generated video with object distortions. Blue plot: Distance between a nearest neighbor and a candidate point. Red plot: Moving average of blue. Green plot: the deviation between blue and red.
  • ...and 6 more figures