Table of Contents
Fetching ...

VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

Wentao Ma, Weiming Ren, Yiming Jia, Zhuofeng Li, Ping Nie, Ge Zhang, Wenhu Chen

TL;DR

Existing long video understanding benchmarks often rely on MCQs and strong priors that inflate model performance and obscure true temporal comprehension. VideoEval-Pro introduces a realistic open-ended QA benchmark built from long videos (average ~38 minutes) with a rigorous data curation pipeline, totaling 1,289 questions across 465 videos and evaluated against 21 proprietary and open-source LMMs. Open-ended answers show a substantial performance drop (>25%) compared to MCQs, with frame scaling significantly improving results, indicating richer temporal context is essential for LVU. The study finds proprietary models still lead on VideoEval-Pro while open-source models lag, highlighting the need for robust, open-ended evaluation to accurately track progress in long-video understanding and guide future model development.

Abstract

Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50\% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs' long-video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment-level and full-video understanding through perception and reasoning tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance ($>$25\%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain.

VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

TL;DR

Existing long video understanding benchmarks often rely on MCQs and strong priors that inflate model performance and obscure true temporal comprehension. VideoEval-Pro introduces a realistic open-ended QA benchmark built from long videos (average ~38 minutes) with a rigorous data curation pipeline, totaling 1,289 questions across 465 videos and evaluated against 21 proprietary and open-source LMMs. Open-ended answers show a substantial performance drop (>25%) compared to MCQs, with frame scaling significantly improving results, indicating richer temporal context is essential for LVU. The study finds proprietary models still lead on VideoEval-Pro while open-source models lag, highlighting the need for robust, open-ended evaluation to accurately track progress in long-video understanding and guide future model development.

Abstract

Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50\% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs' long-video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment-level and full-video understanding through perception and reasoning tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance (25\%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain.

Paper Structure

This paper contains 32 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison between VideoEval-Pro and MCQ benchmarks. Left: MCQ benchmarks yield inflated scores on identical questions (MCQ vs. Open) and can misrepresent model performance (LVBenchwang2024lvbench). Right: VideoEval-Pro cannot be effectively solved with a single input frame, and performance scales consistently with more frames. Video-MME fu2024video exhibits contradictory trends.
  • Figure 2: Summary of VideoEval-Pro data composition and task type distribution.
  • Figure 3: Comparison between VideoEval-Pro and Video-MME accuracy across five LMMs.
  • Figure 4: Qualitative comparisons between VideoEval-Pro and the corresponding MCQ problems.