V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Yiming Zhao; Yu Zeng; Yukun Qi; YaoYang Liu; Lin Chen; Zehui Chen; Xikun Bao; Jie Zhao; Feng Zhao

V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Yiming Zhao, Yu Zeng, Yukun Qi, YaoYang Liu, Lin Chen, Zehui Chen, Xikun Bao, Jie Zhao, Feng Zhao

TL;DR

The paper introduces V2P-Bench, a visual-prompt–driven benchmark to evaluate video understanding in multimodal human-model interaction, addressing the limitations of text-based prompts in LVLM assessments. It defines 5 tasks across 12 dimensions and assembles 980 videos with 1,172 MC QA pairs drawn from 12 datasets, spanning short-to-long durations and diverse video types. Extensive experiments across 16 LVLMs (4 closed-source, 12 open-source) and human experts reveal a sizable gap between state-of-the-art models and human performance, underscoring current shortcomings in video visual-prompt understanding. The work provides a foundation for advancing multimodal interaction and prompts a push toward more robust, instance-level video understanding benchmarks.

Abstract

Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark(V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs' video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instance-level fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts' 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2P-Bench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation. Project page: https://github.com/gaotiexinqu/V2P-Bench.

V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

TL;DR

Abstract

V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)