VideoGen-Eval: Agent-based System for Video Generation Evaluation
Yuhang Yang, Ke Fan, Shangkun Sun, Hongxiang Li, Ailing Zeng, FeiLin Han, Wei Zhai, Wei Liu, Yang Cao, Zheng-Jun Zha
TL;DR
VideoGen-Eval addresses the inadequacy of existing video-generation evaluation by introducing an agent-based, dynamic framework that combines LLM-driven content structuring, multimodal LLM-based judgment, and temporal patch tools to assess temporally dense video attributes. The authors also provide a large-scale benchmark with 700 structured prompts and over 12,000 generated videos from 20 models, plus human annotations for alignment validation. Across extensive experiments, the agent-based system demonstrates stronger alignment with human preferences than static, operator-based benchmarks, with ablations showing the value of structured prompts and temporal tools. The framework offers a scalable, extensible path for robust, human-aligned evaluation as video-generation models continue to evolve, with potential for domain-specific patch tools and post-training improvements.
Abstract
The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models, primarily due to simple prompts that cannot showcase the model's capabilities, fixed evaluation operators struggling with Out-of-Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen-Eval, an agent evaluation system that integrates LLM-based content structuring, MLLM-based content judgment, and patch tools designed for temporal-dense dimensions, to achieve a dynamic, flexible, and expandable video generation evaluation. Additionally, we introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system. It comprises 700 structured, content-rich prompts (both T2V and I2V) and over 12,000 videos generated by 20+ models, among them, 8 cutting-edge models are selected as quantitative evaluation for the agent and human. Extensive experiments validate that our proposed agent-based evaluation system demonstrates strong alignment with human preferences and reliably completes the evaluation, as well as the diversity and richness of the benchmark.
