Table of Contents
Fetching ...

CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation

Kaiwen Wei, Xiao Liu, Jie Zhang, Zijian Wang, Ruida Liu, Yuming Yang, Xin Xiao, Xiao Sun, Haoyang Zeng, Changzai Pan, Yidan Zhang, Jiang Zhong, Peijin Wang, Yingchao Feng

TL;DR

CFVBench introduces a large-scale, manually verified benchmark for fine-grained video-based MRAG, spanning 599 videos and 5,360 open-ended QA pairs across structured data, tutorials, and news. A key finding is that existing MRAG methods struggle with transient, fine-grained multimodal cues, motivating the Adaptive Visual Refinement (AVR) framework that adaptively increases frame sampling and invokes on-demand tools. AVR consistently improves fine-grained comprehension and performance across 14 MLLMs, highlighting the value of dynamic, multimodal evidence augmentation. The dataset and prompts are poised to advance evaluation and development of video-based multimodal reasoning in real-world settings.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) enables Multimodal Large Language Models (MLLMs) to generate responses with external multimodal evidence, and numerous video-based MRAG benchmarks have been proposed to evaluate model capabilities across retrieval and generation stages. However, existing benchmarks remain limited in modality coverage and format diversity, often focusing on single- or limited-modality tasks, or coarse-grained scene understanding. To address these gaps, we introduce CFVBench, a large-scale, manually verified benchmark constructed from 599 publicly available videos, yielding 5,360 open-ended QA pairs. CFVBench spans high-density formats and domains such as chart-heavy reports, news broadcasts, and software tutorials, requiring models to retrieve and reason over long temporal video spans while maintaining fine-grained multimodal information. Using CFVBench, we systematically evaluate 7 retrieval methods and 14 widely-used MLLMs, revealing a critical bottleneck: current models (even GPT5 or Gemini) struggle to capture transient yet essential fine-grained multimodal details. To mitigate this, we propose Adaptive Visual Refinement (AVR), a simple yet effective framework that adaptively increases frame sampling density and selectively invokes external tools when necessary. Experiments show that AVR consistently enhances fine-grained multimodal comprehension and improves performance across all evaluated MLLMs

CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation

TL;DR

CFVBench introduces a large-scale, manually verified benchmark for fine-grained video-based MRAG, spanning 599 videos and 5,360 open-ended QA pairs across structured data, tutorials, and news. A key finding is that existing MRAG methods struggle with transient, fine-grained multimodal cues, motivating the Adaptive Visual Refinement (AVR) framework that adaptively increases frame sampling and invokes on-demand tools. AVR consistently improves fine-grained comprehension and performance across 14 MLLMs, highlighting the value of dynamic, multimodal evidence augmentation. The dataset and prompts are poised to advance evaluation and development of video-based multimodal reasoning in real-world settings.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) enables Multimodal Large Language Models (MLLMs) to generate responses with external multimodal evidence, and numerous video-based MRAG benchmarks have been proposed to evaluate model capabilities across retrieval and generation stages. However, existing benchmarks remain limited in modality coverage and format diversity, often focusing on single- or limited-modality tasks, or coarse-grained scene understanding. To address these gaps, we introduce CFVBench, a large-scale, manually verified benchmark constructed from 599 publicly available videos, yielding 5,360 open-ended QA pairs. CFVBench spans high-density formats and domains such as chart-heavy reports, news broadcasts, and software tutorials, requiring models to retrieve and reason over long temporal video spans while maintaining fine-grained multimodal information. Using CFVBench, we systematically evaluate 7 retrieval methods and 14 widely-used MLLMs, revealing a critical bottleneck: current models (even GPT5 or Gemini) struggle to capture transient yet essential fine-grained multimodal details. To mitigate this, we propose Adaptive Visual Refinement (AVR), a simple yet effective framework that adaptively increases frame sampling density and selectively invokes external tools when necessary. Experiments show that AVR consistently enhances fine-grained multimodal comprehension and improves performance across all evaluated MLLMs

Paper Structure

This paper contains 23 sections, 2 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparison of video-based MRAG benchmarks with CFVBench, where clues are embedded in tables/on-screen text of video frames, requiring fine-grained reasoning.
  • Figure 2: The dataset construction process of CFVBench.
  • Figure 3: The distribution of videos in CFVBench.
  • Figure 4: Human evaluation of typical MLLMs on CFVBench.
  • Figure 5: The workflow of the AVR framework.
  • ...and 5 more figures