Table of Contents
Fetching ...

VidText: Towards Comprehensive Evaluation for Video Text Understanding

Zhoufaran Yang, Yan Shu, Jing Wang, Zhifei Yang, Yan Zhang, Yu Li, Keyang Lu, Gangyan Zeng, Shaohui Liu, Yu Zhou, Nicu Sebe

TL;DR

VidText introduces a broad, multilingual benchmark for video text understanding with a hierarchical evaluation framework spanning video-, clip-, and instance-level tasks, and pairs perceptual and reasoning challenges to assess multimodal integration of text in dynamic contexts. The dataset comprises 939 long-form videos across 27 categories with multilingual content, accompanied by multi-granularity perception annotations and video-text–centric CoT reasoning annotations to support eight tasks. An extensive evaluation of 18 large multimodal models reveals that current systems struggle across most tasks, with proprietary models generally outperforming open-source baselines; OCR and grounding capabilities, as well as multi-task reasoning, remain bottlenecks. Ablation studies show that higher input resolution, stronger OCR, auxiliary information, and Chain-of-Thought prompting improve performance, while segment-based evaluation benefits clip/instance tasks more than holistic reasoning. Overall, VidText provides a foundation for advancing OCR, video understanding, and multimodal reasoning in dynamic, text-rich environments.

Abstract

Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.

VidText: Towards Comprehensive Evaluation for Video Text Understanding

TL;DR

VidText introduces a broad, multilingual benchmark for video text understanding with a hierarchical evaluation framework spanning video-, clip-, and instance-level tasks, and pairs perceptual and reasoning challenges to assess multimodal integration of text in dynamic contexts. The dataset comprises 939 long-form videos across 27 categories with multilingual content, accompanied by multi-granularity perception annotations and video-text–centric CoT reasoning annotations to support eight tasks. An extensive evaluation of 18 large multimodal models reveals that current systems struggle across most tasks, with proprietary models generally outperforming open-source baselines; OCR and grounding capabilities, as well as multi-task reasoning, remain bottlenecks. Ablation studies show that higher input resolution, stronger OCR, auxiliary information, and Chain-of-Thought prompting improve performance, while segment-based evaluation benefits clip/instance tasks more than holistic reasoning. Overall, VidText provides a foundation for advancing OCR, video understanding, and multimodal reasoning in dynamic, text-rich environments.

Abstract

Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.

Paper Structure

This paper contains 32 sections, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Statistical overview of our VidText. (Left) Video genres included in VidText. (Top Right) Visual Text Instance Distribution. (Bottom Right) Hierarchical Task type settings.
  • Figure 2: Examples from VidText. The benchmark includes eight tasks, featuring paired perception and reasoning components designed to evaluate the video-level, clip-level, and instance-level capabilities of LMMs. Given the video input and textual prompt, models are required to solve the tasks, with ground-truth answers highlighted in green.
  • Figure 3: Ablation studies on the multi-granularity design of VidText.
  • Figure 4: Ablation studies on the joint reasoning of video texts and video contents. "HR", "LR" and "SR" denote Holistic Reasoning, Local Reasoning and Spatial Reasoning, respectively. We visualize "Video content masking" and "Video Text masking" in the right part.
  • Figure 5: Text quantity distribution across six scene categories.
  • ...and 12 more figures