VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Tingyu Song; Tongyan Hu; Guo Gan; Yilun Zhao

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Tingyu Song, Tongyan Hu, Guo Gan, Yilun Zhao

TL;DR

VF-Eval introduces a four-task benchmark to evaluate how multimodal LLMs provide feedback on AI-generated videos, emphasizing coherence, error detection, and reasoning. By testing 13 frontier models, including GPT-4.1, the study shows that current MLLMs struggle with AIGC-specific content, and that augmenting feedback with human-aligned prompts (RePrompt) can improve video generation. The dataset, built from a mix of proprietary and open-source V2V sources and enriched with expert validation, enables fine-grained analysis of reasoning abilities and error types. The work highlights the potential and limits of LLM-based feedback in AIGC video pipelines and suggests practical paths for improving quality via multi-tool integration and better prompting strategies.

Abstract

MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

TL;DR

Abstract

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)