Table of Contents
Fetching ...

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

Yuanxin Liu, Rui Zhu, Shuhuai Ren, Jiacong Wang, Haoyuan Guo, Xu Sun, Lu Jiang

TL;DR

This work investigates using multimodal large language models (MLLMs) as unified evaluators for AI-generated videos (AIGVs). It introduces UVE-Bench, a comprehensive benchmark with 15 evaluation aspects and pairwise human preferences, enabling zero-shot evaluation of both single-video ratings and video pair comparisons. Through extensive experiments on 18 MLLMs, the study finds that advanced MLLMs can outperform specialized evaluators across many aspects but still lag behind human judgment, particularly for temporal dynamics, and provides actionable insights on prompting, scoring, and data frame usage. The results highlight the potential of MLLMs as versatile AIGV evaluators while outlining concrete avenues for improvement in future research and practical deployment.

Abstract

With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 18 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation.

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

TL;DR

This work investigates using multimodal large language models (MLLMs) as unified evaluators for AI-generated videos (AIGVs). It introduces UVE-Bench, a comprehensive benchmark with 15 evaluation aspects and pairwise human preferences, enabling zero-shot evaluation of both single-video ratings and video pair comparisons. Through extensive experiments on 18 MLLMs, the study finds that advanced MLLMs can outperform specialized evaluators across many aspects but still lag behind human judgment, particularly for temporal dynamics, and provides actionable insights on prompting, scoring, and data frame usage. The results highlight the potential of MLLMs as versatile AIGV evaluators while outlining concrete avenues for improvement in future research and practical deployment.

Abstract

With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 18 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation.

Paper Structure

This paper contains 58 sections, 9 equations, 9 figures, 22 tables.

Figures (9)

  • Figure 1: Illustration of MLLM-based unified evaluator and specialized evaluators.
  • Figure 2: Overview of UVE-Bench. (a) The distribution of video sources. (b) The distribution of data example over 15 fine-grained AIGV evaluation aspects. (c) The distribution of human preference over the four categories. (d) Data examples illustrating how to evaluate both single video rating and video pair comparison using the human preference annotations. More examples can be found in Appendix \ref{['app:case_study']}.
  • Figure 3: Results of different prompting strategies.
  • Figure 5: Results with varying numbers of video frames.
  • Figure 7: Functions $f_{c}(\mathcal{S}|\beta), f'_{c}(\mathcal{S}|\alpha)$ used in the evaluation criteria of single video rating.
  • ...and 4 more figures