Table of Contents
Fetching ...

VideoGen-Eval: Agent-based System for Video Generation Evaluation

Yuhang Yang, Ke Fan, Shangkun Sun, Hongxiang Li, Ailing Zeng, FeiLin Han, Wei Zhai, Wei Liu, Yang Cao, Zheng-Jun Zha

TL;DR

VideoGen-Eval addresses the inadequacy of existing video-generation evaluation by introducing an agent-based, dynamic framework that combines LLM-driven content structuring, multimodal LLM-based judgment, and temporal patch tools to assess temporally dense video attributes. The authors also provide a large-scale benchmark with 700 structured prompts and over 12,000 generated videos from 20 models, plus human annotations for alignment validation. Across extensive experiments, the agent-based system demonstrates stronger alignment with human preferences than static, operator-based benchmarks, with ablations showing the value of structured prompts and temporal tools. The framework offers a scalable, extensible path for robust, human-aligned evaluation as video-generation models continue to evolve, with potential for domain-specific patch tools and post-training improvements.

Abstract

The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models, primarily due to simple prompts that cannot showcase the model's capabilities, fixed evaluation operators struggling with Out-of-Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen-Eval, an agent evaluation system that integrates LLM-based content structuring, MLLM-based content judgment, and patch tools designed for temporal-dense dimensions, to achieve a dynamic, flexible, and expandable video generation evaluation. Additionally, we introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system. It comprises 700 structured, content-rich prompts (both T2V and I2V) and over 12,000 videos generated by 20+ models, among them, 8 cutting-edge models are selected as quantitative evaluation for the agent and human. Extensive experiments validate that our proposed agent-based evaluation system demonstrates strong alignment with human preferences and reliably completes the evaluation, as well as the diversity and richness of the benchmark.

VideoGen-Eval: Agent-based System for Video Generation Evaluation

TL;DR

VideoGen-Eval addresses the inadequacy of existing video-generation evaluation by introducing an agent-based, dynamic framework that combines LLM-driven content structuring, multimodal LLM-based judgment, and temporal patch tools to assess temporally dense video attributes. The authors also provide a large-scale benchmark with 700 structured prompts and over 12,000 generated videos from 20 models, plus human annotations for alignment validation. Across extensive experiments, the agent-based system demonstrates stronger alignment with human preferences than static, operator-based benchmarks, with ablations showing the value of structured prompts and temporal tools. The framework offers a scalable, extensible path for robust, human-aligned evaluation as video-generation models continue to evolve, with potential for domain-specific patch tools and post-training improvements.

Abstract

The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models, primarily due to simple prompts that cannot showcase the model's capabilities, fixed evaluation operators struggling with Out-of-Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen-Eval, an agent evaluation system that integrates LLM-based content structuring, MLLM-based content judgment, and patch tools designed for temporal-dense dimensions, to achieve a dynamic, flexible, and expandable video generation evaluation. Additionally, we introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system. It comprises 700 structured, content-rich prompts (both T2V and I2V) and over 12,000 videos generated by 20+ models, among them, 8 cutting-edge models are selected as quantitative evaluation for the agent and human. Extensive experiments validate that our proposed agent-based evaluation system demonstrates strong alignment with human preferences and reliably completes the evaluation, as well as the diversity and richness of the benchmark.

Paper Structure

This paper contains 22 sections, 2 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: VideoGen-Eval. Our benchmark includes structured prompts with rich content, large-scale results generated by multiple cutting-edge models, and human annotations. We also propose an agent-based dynamic evaluation system that can reliably complete the evaluation and adapt to human preferences.
  • Figure 2: The existing benchmarks adopt unreasonable evaluation operators, such as assigning high scores to videos with noticeable flickering while penalizing slight camera movements with disproportionately low semantic consistency scores.
  • Figure 3: Statistics of the word cloud in the collected prompts.
  • Figure 4: Pipeline overview. The agent-based evaluation system is mainly composed of three parts: LLM-based content structure, MLLM-based judged, and patch tools. The content structurer parses the input prompt into dimension-specific content and sends it, along with the generated video, to the MLLM-based content judger. Leveraging the MLLM fundamental objective understanding capabilities and externally invoked temporally dense tools, the system assesses whether multiple dimensions of the input are accurately generated. The resulting scores and feedback are used for ranking, evaluation, and potentially supporting post-training.
  • Figure 5: Comparisons among Vbench operators, our agent system, and human rankings on several evaluation dimensions.
  • ...and 9 more figures