Table of Contents
Fetching ...

GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning

Zhun Mou, Bin Xia, Zhengchao Huang, Wenming Yang, Jiaya Jia

TL;DR

<3-5 sentence high-level summary> The paper tackles the challenge of evaluating text-to-video generation beyond traditional perceptual metrics by introducing GRADEO, a human-like evaluator built on multi-step reasoning. It introduces GRADEO-Instruct, a 3.3k-video, 16k-annotation dataset, and trains an evaluator via instruction-tuning and LoRA on Qwen2-VL-7B-Instruct, enabling explanations and rationales for scores across seven dimensions. Empirical results show superior alignment with human judgments compared with baselines, and benchmarking across recent T2V models reveals that current systems struggle with real-world alignment, safety, and narrative coherence. The work provides a scalable, interpretable framework and dataset to drive safer, more realistic, and better-validated text-to-video generation research and applications.

Abstract

Recent great advances in video generation models have demonstrated their potential to produce high-quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack highlevel semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curate GRADEO-Instruct, a multi-dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduce GRADEO, one of the first specifically designed video evaluation models, which grades AI-generated videos for explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios.

GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning

TL;DR

<3-5 sentence high-level summary> The paper tackles the challenge of evaluating text-to-video generation beyond traditional perceptual metrics by introducing GRADEO, a human-like evaluator built on multi-step reasoning. It introduces GRADEO-Instruct, a 3.3k-video, 16k-annotation dataset, and trains an evaluator via instruction-tuning and LoRA on Qwen2-VL-7B-Instruct, enabling explanations and rationales for scores across seven dimensions. Empirical results show superior alignment with human judgments compared with baselines, and benchmarking across recent T2V models reveals that current systems struggle with real-world alignment, safety, and narrative coherence. The work provides a scalable, interpretable framework and dataset to drive safer, more realistic, and better-validated text-to-video generation research and applications.

Abstract

Recent great advances in video generation models have demonstrated their potential to produce high-quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack highlevel semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curate GRADEO-Instruct, a multi-dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduce GRADEO, one of the first specifically designed video evaluation models, which grades AI-generated videos for explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios.

Paper Structure

This paper contains 52 sections, 3 equations, 19 figures, 10 tables.

Figures (19)

  • Figure 1: Traditional evaluation methods, limited by small datasets and model parameters, suffer from three key issues: (1) inability to accurately understand video content, (2) lack of explainability with only score outputs, and (3) a focus on low-level features like video quality, neglecting high-level aspects such as rationality, safety and creativity. We propose GRADEO, a novel approach that leverages human-like reasoning for comprehensive video evaluation, enabling accurate and interpretable assessments.
  • Figure 2: An overview of GRADEO. a)Dataset Construction Pipeline. First, we source (prompt,video) data, and collect human annotations. Then, we convert them to instruction tuning datasets. b)Evaluation Process Pipeline. GRADEO generates assessment score after multi-step reasoning.
  • Figure 3: Human score distribution for datasets across dimensions in GRADEO-Instruct.
  • Figure 4: Qualitative Results for Creativity Dimension. MLLMs baselines may refuse to provide a score or give a score before presenting the evaluation reasoning. In contrast, our approach generates detailed reasoning steps prior to assigning a final score, resulting in higher consistency with human assessments.
  • Figure 5: Qualitative Results for Rationality Dimension. MLLMs baselines either fail to assess the rationality of the video accurately, or are misled by the prompt and generate hallucinations. Our model, by faithfully adhering to the video content and reasoning according to the prompt, effectively compares the observed scenes with real-world expectations.
  • ...and 14 more figures