Table of Contents
Fetching ...

McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning

Qiushi Yang, Yingjie Chen, Yuan Yao, Yifang Men, Huaizhuo Liu, Miaomiao Cui

TL;DR

The paper tackles the challenge of aligning text-to-video generation with human preferences, which are inherently multi-dimensional and subjective.It introduces McSc, a three-stage reinforcement learning framework comprising ScDR for per-dimension reasoning, HCR for holistic comparison, and McDPO for motion-aware preference optimization.A self-critic reward model and hierarchical reasoning are trained to mimic human decision logic, while a motion-corrective weighting scheme mitigates bias towards low-motion content during alignment.Empirical results show state-of-the-art preference alignment and higher-motion video outputs across benchmarks, demonstrating the method's effectiveness and potential impact on practical T2V systems.

Abstract

Text-to-video (T2V) generation has achieved remarkable progress in producing high-quality videos aligned with textual prompts. However, aligning synthesized videos with nuanced human preference remains challenging due to the subjective and multifaceted nature of human judgment. Existing video preference alignment methods rely on costly human annotations or utilize proxy metrics to predict preference, which lacks the understanding of human preference logic. Moreover, they usually directly align T2V models with the overall preference distribution, ignoring potential conflict dimensions like motion dynamics and visual quality, which may bias models towards low-motion content. To address these issues, we present Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc), a three-stage reinforcement learning framework for robust preference modeling and alignment. Firstly, Self-critic Dimensional Reasoning (ScDR) trains a generative reward model (RM) to decompose preferences into per-dimension assessments, using self-critic reasoning chains for reliable learning. Secondly, to achieve holistic video comparison, we introduce Hierarchical Comparative Reasoning (HCR) for structural multi-dimensional reasoning with hierarchical reward supervision. Finally, using RM-preferred videos, we propose Motion-corrective Direct Preference Optimization (McDPO) to optimize T2V models, while dynamically re-weighting alignment objective to mitigate bias towards low-motion content. Experiments show that McSc achieves superior performance in human preference alignment and generates videos with high-motion dynamic.

McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning

TL;DR

The paper tackles the challenge of aligning text-to-video generation with human preferences, which are inherently multi-dimensional and subjective.It introduces McSc, a three-stage reinforcement learning framework comprising ScDR for per-dimension reasoning, HCR for holistic comparison, and McDPO for motion-aware preference optimization.A self-critic reward model and hierarchical reasoning are trained to mimic human decision logic, while a motion-corrective weighting scheme mitigates bias towards low-motion content during alignment.Empirical results show state-of-the-art preference alignment and higher-motion video outputs across benchmarks, demonstrating the method's effectiveness and potential impact on practical T2V systems.

Abstract

Text-to-video (T2V) generation has achieved remarkable progress in producing high-quality videos aligned with textual prompts. However, aligning synthesized videos with nuanced human preference remains challenging due to the subjective and multifaceted nature of human judgment. Existing video preference alignment methods rely on costly human annotations or utilize proxy metrics to predict preference, which lacks the understanding of human preference logic. Moreover, they usually directly align T2V models with the overall preference distribution, ignoring potential conflict dimensions like motion dynamics and visual quality, which may bias models towards low-motion content. To address these issues, we present Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc), a three-stage reinforcement learning framework for robust preference modeling and alignment. Firstly, Self-critic Dimensional Reasoning (ScDR) trains a generative reward model (RM) to decompose preferences into per-dimension assessments, using self-critic reasoning chains for reliable learning. Secondly, to achieve holistic video comparison, we introduce Hierarchical Comparative Reasoning (HCR) for structural multi-dimensional reasoning with hierarchical reward supervision. Finally, using RM-preferred videos, we propose Motion-corrective Direct Preference Optimization (McDPO) to optimize T2V models, while dynamically re-weighting alignment objective to mitigate bias towards low-motion content. Experiments show that McSc achieves superior performance in human preference alignment and generates videos with high-motion dynamic.

Paper Structure

This paper contains 13 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Given two videos where video 2 is preferred by humans, (a) and (b) show score-based and VLM-based reward approaches failing to make reliable judgments. (c) In contrast, our self-critic hierarchical reasoning correctly identifies the preference via single-dimension then holistic reasoning. (d) Dimension correlations reveal a strong negative coupling between motion dynamic and visual quality, indicating a potential source of reward bias.
  • Figure 2: Alignment video generation results of the proposed method from the text-to-video model.
  • Figure 3: Illustration of the McSc framework containing preference prediction and preference alignment across three stages. In preference prediction, ScDR as the first phase exploits single-dimension preference judgment, and HCR as the second stage assess overall video quality through multi-dimensional analysis. For preference alignment, McDPO mitigates bias from negatively coupled dimensions for motion-enhanced and reliable alignment.
  • Figure 4: Visualization comparison of generated videos by the baselines and the proposed method. Our McSc generate videos with larger motion dynamic and stronger semantic alignment.
  • Figure 5: Human evaluation on preference rate of our model with SFT and VideoDPO on VBench across four aspects, including instruction-following (IF), motion dynamic (MD), visual quality (VQ) and overall (OV) performance.
  • ...and 1 more figures