VidText: Towards Comprehensive Evaluation for Video Text Understanding
Zhoufaran Yang, Yan Shu, Jing Wang, Zhifei Yang, Yan Zhang, Yu Li, Keyang Lu, Gangyan Zeng, Shaohui Liu, Yu Zhou, Nicu Sebe
TL;DR
VidText introduces a broad, multilingual benchmark for video text understanding with a hierarchical evaluation framework spanning video-, clip-, and instance-level tasks, and pairs perceptual and reasoning challenges to assess multimodal integration of text in dynamic contexts. The dataset comprises 939 long-form videos across 27 categories with multilingual content, accompanied by multi-granularity perception annotations and video-text–centric CoT reasoning annotations to support eight tasks. An extensive evaluation of 18 large multimodal models reveals that current systems struggle across most tasks, with proprietary models generally outperforming open-source baselines; OCR and grounding capabilities, as well as multi-task reasoning, remain bottlenecks. Ablation studies show that higher input resolution, stronger OCR, auxiliary information, and Chain-of-Thought prompting improve performance, while segment-based evaluation benefits clip/instance tasks more than holistic reasoning. Overall, VidText provides a foundation for advancing OCR, video understanding, and multimodal reasoning in dynamic, text-rich environments.
Abstract
Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.
