Table of Contents
Fetching ...

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Tingyu Song, Tongyan Hu, Guo Gan, Yilun Zhao

TL;DR

VF-Eval introduces a four-task benchmark to evaluate how multimodal LLMs provide feedback on AI-generated videos, emphasizing coherence, error detection, and reasoning. By testing 13 frontier models, including GPT-4.1, the study shows that current MLLMs struggle with AIGC-specific content, and that augmenting feedback with human-aligned prompts (RePrompt) can improve video generation. The dataset, built from a mix of proprietary and open-source V2V sources and enriched with expert validation, enables fine-grained analysis of reasoning abilities and error types. The work highlights the potential and limits of LLM-based feedback in AIGC video pipelines and suggests practical paths for improving quality via multi-tool integration and better prompting strategies.

Abstract

MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

TL;DR

VF-Eval introduces a four-task benchmark to evaluate how multimodal LLMs provide feedback on AI-generated videos, emphasizing coherence, error detection, and reasoning. By testing 13 frontier models, including GPT-4.1, the study shows that current MLLMs struggle with AIGC-specific content, and that augmenting feedback with human-aligned prompts (RePrompt) can improve video generation. The dataset, built from a mix of proprietary and open-source V2V sources and enriched with expert validation, enables fine-grained analysis of reasoning abilities and error types. The work highlights the potential and limits of LLM-based feedback in AIGC video pipelines and suggests practical paths for improving quality via multi-tool integration and better prompting strategies.

Abstract

MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.

Paper Structure

This paper contains 41 sections, 5 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Overview of our research: (a) Collection of AIGC videos: We compile a diverse set of video generation prompts to instruct both proprietary and open-source T2V models for generating AIGC videos. (b) Illustration of errors occurring within the same AIGC video. (c) Analytics of the dataset: VF-Eval covering a diverse range of reasoning tasks. And it contains AIGC videos with durations between 4 to 12 seconds, reflecting the typical output length of current T2V models.
  • Figure 2: Illustration of four proposed tasks and the corresponding question types in the VF-Eval benchmark. Detailed examples for each reasoning task are provided in Appendix \ref{['appendix-reasoning-examples']}.
  • Figure 3: Performance Comparison of InternVL3-38B.
  • Figure 4: Performance comparison within four models on six reasoning sub-tasks.
  • Figure 5: UI of annotation.
  • ...and 12 more figures