Table of Contents
Fetching ...

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

David Ma, Yuanxing Zhang, Jincheng Ren, Jarvis Guo, Yifan Yao, Zhenlin Wei, Zhenzhu Yang, Zhongyuan Peng, Boyu Feng, Jun Ma, Xiao Gu, Zhoufutu Wen, King Zhu, Yancheng He, Meng Cao, Shiwen Ni, Jiaheng Liu, Wenhao Huang, Ge Zhang, Xiaojie Jin

TL;DR

This work introduces IV-Bench, the first benchmark explicitly designed to assess image-grounded video perception and reasoning in multimodal LLMs. It provides 967 videos with 2,585 externally sourced image-text queries across 13 tasks in five categories, accompanied by rigorous two-round quality control. Extensive evaluations of 27 open-source and 4 closed-source models reveal that current systems struggle to leverage image context for video understanding, with overall accuracy around 28.9% and temporal reasoning particularly challenging. Ablation studies show frame rate and the placement of image context influence performance, while a simple synthetic data approach yields only modest gains, suggesting that deeper methodological advances are needed. The benchmark and accompanying data may catalyze progress in image-grounded video perception and reasoning for real-world multimodal reasoning systems.

Abstract

Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released in https://github.com/multimodal-art-projection/IV-Bench.

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

TL;DR

This work introduces IV-Bench, the first benchmark explicitly designed to assess image-grounded video perception and reasoning in multimodal LLMs. It provides 967 videos with 2,585 externally sourced image-text queries across 13 tasks in five categories, accompanied by rigorous two-round quality control. Extensive evaluations of 27 open-source and 4 closed-source models reveal that current systems struggle to leverage image context for video understanding, with overall accuracy around 28.9% and temporal reasoning particularly challenging. Ablation studies show frame rate and the placement of image context influence performance, while a simple synthetic data approach yields only modest gains, suggesting that deeper methodological advances are needed. The benchmark and accompanying data may catalyze progress in image-grounded video perception and reasoning for real-world multimodal reasoning systems.

Abstract

Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released in https://github.com/multimodal-art-projection/IV-Bench.

Paper Structure

This paper contains 31 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) Video Categories. IV-Bench includes videos spanning five representative categories, ensuring diverse topical coverage. (b) Task distribution in IV-Bench. IV-Bench consists of a total of 13 tasks, which are categorized into two main types: 6 reasoning tasks and 7 perception tasks. (c) Model Performance on IV-Bench. All evaluated MLLMs exhibit limited performance on IV-Bench. Even on the best-performing task (Natural Language Inference), the highest achieved accuracy is merely 64.7%, with other tasks resulting in substantially lower scores.
  • Figure 2: Representative examples from IV-Bench. Each sample consists of a video paired with an image-text query, comprising a query image and corresponding query text. The correct answer is marked in green, with relevant video frames also highlighted in green.
  • Figure 3: Comparison of model performance: (a) across different inference patterns and (b) with varying numbers of frames. MCPMv/o represent MiniCPMv/o, IVL is the abbreviation of InternVL.
  • Figure 4: Comparison of model performance: (a) across different video resolutions and (b) across various frame-resolution combinations.
  • Figure 5: Five remaining IV‑Bench task categories: Natural Language Inference, Constrained OCR, Spatial Relationship, Reasoning, and Temporal Reasoning. Each example requires using text, image, and video together.
  • ...and 2 more figures