Table of Contents
Fetching ...

ALLVB: All-in-One Long Video Understanding Benchmark

Xichen Tan, Yuanjing Luo, Yunfan Ye, Fang Liu, Zhiping Cai

TL;DR

ALLVB addresses the need for a rigorous, scalable long-video understanding benchmark by integrating 9 established video tasks into a unified video QA framework. It leverages a fully automated GPT-4o-based pipeline to generate 1,376 hour-long videos with 252k Q&As across 16 genres, enabling objective, scalable evaluation of multi-modal LLMs. The study demonstrates that current MLLMs, including state-of-the-art models, still exhibit substantial gaps in long-context video reasoning, with tasks like Object Detection/Tracking and Needle-in-a-Haystack being notably challenging. By providing comprehensive benchmarks and analysis of video length and frame count effects, ALLVB offers a valuable, future-facing metric to drive progress in long video understanding for MLLMs.

Abstract

From image to video understanding, the capabilities of Multi-modal LLMs (MLLMs) are increasingly powerful. However, most existing video understanding benchmarks are relatively short, which makes them inadequate for effectively evaluating the long-sequence modeling capabilities of MLLMs. This highlights the urgent need for a comprehensive and integrated long video understanding benchmark to assess the ability of MLLMs thoroughly. To this end, we propose ALLVB (ALL-in-One Long Video Understanding Benchmark). ALLVB's main contributions include: 1) It integrates 9 major video understanding tasks. These tasks are converted into video QA formats, allowing a single benchmark to evaluate 9 different video understanding capabilities of MLLMs, highlighting the versatility, comprehensiveness, and challenging nature of ALLVB. 2) A fully automated annotation pipeline using GPT-4o is designed, requiring only human quality control, which facilitates the maintenance and expansion of the benchmark. 3) It contains 1,376 videos across 16 categories, averaging nearly 2 hours each, with a total of 252k QAs. To the best of our knowledge, it is the largest long video understanding benchmark in terms of the number of videos, average duration, and number of QAs. We have tested various mainstream MLLMs on ALLVB, and the results indicate that even the most advanced commercial models have significant room for improvement. This reflects the benchmark's challenging nature and demonstrates the substantial potential for development in long video understanding.

ALLVB: All-in-One Long Video Understanding Benchmark

TL;DR

ALLVB addresses the need for a rigorous, scalable long-video understanding benchmark by integrating 9 established video tasks into a unified video QA framework. It leverages a fully automated GPT-4o-based pipeline to generate 1,376 hour-long videos with 252k Q&As across 16 genres, enabling objective, scalable evaluation of multi-modal LLMs. The study demonstrates that current MLLMs, including state-of-the-art models, still exhibit substantial gaps in long-context video reasoning, with tasks like Object Detection/Tracking and Needle-in-a-Haystack being notably challenging. By providing comprehensive benchmarks and analysis of video length and frame count effects, ALLVB offers a valuable, future-facing metric to drive progress in long video understanding for MLLMs.

Abstract

From image to video understanding, the capabilities of Multi-modal LLMs (MLLMs) are increasingly powerful. However, most existing video understanding benchmarks are relatively short, which makes them inadequate for effectively evaluating the long-sequence modeling capabilities of MLLMs. This highlights the urgent need for a comprehensive and integrated long video understanding benchmark to assess the ability of MLLMs thoroughly. To this end, we propose ALLVB (ALL-in-One Long Video Understanding Benchmark). ALLVB's main contributions include: 1) It integrates 9 major video understanding tasks. These tasks are converted into video QA formats, allowing a single benchmark to evaluate 9 different video understanding capabilities of MLLMs, highlighting the versatility, comprehensiveness, and challenging nature of ALLVB. 2) A fully automated annotation pipeline using GPT-4o is designed, requiring only human quality control, which facilitates the maintenance and expansion of the benchmark. 3) It contains 1,376 videos across 16 categories, averaging nearly 2 hours each, with a total of 252k QAs. To the best of our knowledge, it is the largest long video understanding benchmark in terms of the number of videos, average duration, and number of QAs. We have tested various mainstream MLLMs on ALLVB, and the results indicate that even the most advanced commercial models have significant room for improvement. This reflects the benchmark's challenging nature and demonstrates the substantial potential for development in long video understanding.

Paper Structure

This paper contains 112 sections, 19 figures, 4 tables.

Figures (19)

  • Figure 1: The construction pipeline of ALLVB. Utilizing the powerful processing capabilities of GPT-4o, we first segment the movie into different sub-plots based on the corresponding script content. We then create Q&As for the entire video, each sub-plot, and evenly divided needle segments using 91 question templates. Note that needle segments do not correspond to the sub-plot segments.
  • Figure 2: ALLVB Benchmark Statistics Chart. The distributions shown, from left to right, are video duration, number of sub-plots per video, number of Q&As per video, and video genres. Most videos are between 90-150 minutes in length, which is significantly longer than those in other benchmarks, highlighting the challenge of ALLVB. The majority of videos are divided into 5-20 sub-plots, resulting in most videos having 100-250 Q&As, showcasing the benchmark's comprehensiveness. Finally, our videos span 16 diverse genres, ensuring the benchmark's general applicability.
  • Figure 3: Examples of the 11 sub-tasks for the Video Classification (VC) task.
  • Figure 4: Examples of the 12 sub-tasks for the Scene Recognition (SR) task.
  • Figure 5: Examples of the 10 sub-tasks for the Object Detection and Tracking (ODT) task.
  • ...and 14 more figures