Table of Contents
Fetching ...

VMBench: A Benchmark for Perception-Aligned Video Motion Generation

Xinran Ling, Chen Zhu, Meiqi Wu, Hangyu Li, Xiaokun Feng, Cundian Yang, Aiming Hao, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu

TL;DR

VMBench tackles the misalignment between current video motion evaluation and human perception by introducing perception-aligned motion metrics (PMM), a scalable Meta-Guided Motion Prompt Generation (MMPG) pipeline, and a human-aligned validation framework. The approach yields a comprehensive benchmark across six motion domains with 969 categories, supported by 1,050 prompts and human preference annotations, demonstrating a 35.3% improvement in Spearman correlation over baselines. By combining fine-grained metrics (CAS, MSS, OIS, PAS, TCS) with diverse, physics-informed prompts, VMBench provides actionable diagnostics to diagnose and improve motion quality in video generation models. The open-source release—including prompts, evaluation methods, generated videos, and annotations—aims to standardize motion evaluation and accelerate progress toward perceptually realistic, dynamically consistent video generation.

Abstract

Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based on these findings, we introduce VMBench--a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: 1) Perception-Driven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. 2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multi-level prompt library covering six key dynamic scene dimensions. 3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3% improvement in Spearman's correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment. Additionally, we will soon release VMBench at https://github.com/GD-AIGC/VMBench, setting a new standard for evaluating and advancing motion generation models.

VMBench: A Benchmark for Perception-Aligned Video Motion Generation

TL;DR

VMBench tackles the misalignment between current video motion evaluation and human perception by introducing perception-aligned motion metrics (PMM), a scalable Meta-Guided Motion Prompt Generation (MMPG) pipeline, and a human-aligned validation framework. The approach yields a comprehensive benchmark across six motion domains with 969 categories, supported by 1,050 prompts and human preference annotations, demonstrating a 35.3% improvement in Spearman correlation over baselines. By combining fine-grained metrics (CAS, MSS, OIS, PAS, TCS) with diverse, physics-informed prompts, VMBench provides actionable diagnostics to diagnose and improve motion quality in video generation models. The open-source release—including prompts, evaluation methods, generated videos, and annotations—aims to standardize motion evaluation and accelerate progress toward perceptually realistic, dynamically consistent video generation.

Abstract

Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based on these findings, we introduce VMBench--a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: 1) Perception-Driven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. 2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multi-level prompt library covering six key dynamic scene dimensions. 3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3% improvement in Spearman's correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment. Additionally, we will soon release VMBench at https://github.com/GD-AIGC/VMBench, setting a new standard for evaluating and advancing motion generation models.

Paper Structure

This paper contains 27 sections, 10 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Overview of VMBench. Our benchmark encompasses six principal categories of motion patterns, with each prompt constructed as a comprehensive motion structured around three core components: subject, place, and acion. We propose a novel multi-dimensional video motion evaluation comprising five human-centric quality metrics derived from perceptual preferences. Utilizing videos generated by popular T2V models, we conduct systematic human evaluations to validate the effectiveness of our metrics in capturing human perceptual preferences.
  • Figure 2: Framework of our Perception-Driven Motion Metrics (PMM). PMM comprises multiple evaluation metrics: Commonsense Adherence Score (CAS), Motion Smoothness Score (MSS), Object Integrity Score (OIS), Perceptible Amplitude Score (PAS), and Temporal Coherence Score (TCS). (a-e): Computational flowcharts for each metric. The scores produced by PMM show variation trends consistent with human assessments, indicating strong alignment with human perception.
  • Figure 3: Framework of our Meta-Guided MotionPrompt Generation (MMPG). MMPG consists of three stages: (a) Meta-information Extraction: Extracting Subjects, Places, and Actions from datasets such as VidProm wang2024vidprom, Didemo anne2017localizing, MSR-VTT xu2016msr, WebVid bain2021frozen, Place365 zhou2017places, and Kinect-700 carreira2019short. (b) Self-Refining Prompt Generation: Generating and iteratively refining prompts based on the extracted information. (c) Human-LLM Joint Validation: Validating the prompts through a collaborative process between humans and DeepSeek-R1 to ensure their rationality.
  • Figure 4: Correlation Matrix Analysis of Metrics Within Different Evaluation Mechanisms. (a): Spearman Correlation Matrices for human annotations; (b): Spearman Correlation Matrices for our PMM metrics.
  • Figure 5: Our metrics framework for evaluating video motion, which is inspired by the mechanisms of human perception of motion in videos. (a) Human perception of motion in videos primarily encompasses two dimensions: Comprehensive Analysis of Motion and Capture of Motion Details. (b) Our proposed metrics framework for evaluating video motion. Specifically, the MSS and CAS correspond to the human process of Comprehensive Analysis of Motion, while the OIS, PAS, and TCS correspond to the capture of motion details.
  • ...and 11 more figures