Table of Contents
Fetching ...

VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning

Shaoyang Cui, Lingbei Meng

Abstract

Video-based numerical reasoning provides a premier arena for testing whether Vision-Language Models (VLMs) truly "understand" real-world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching. However, existing benchmarks are often confined to narrow domains, such as repetitive athletic motions, or treat simple counting merely as a superficial regression task, failing to assess multi-step numerical logic within the inherent complexity of real-world multimedia content. We introduce VidNum-1.4K, a comprehensive VideoQA benchmark comprising 1,379 strictly human-annotated video-question pairs designed to evaluate genuine numerical reasoning across highly diverse environments, encompassing object, action, and event quantification. The VidNum-1.4K is uniquely structured into a three-level hierarchy that evolves from direct visual perception to video-based compositional numerical reasoning, requiring models to perform arithmetic operations, comparisons, and logical deductions grounded in temporal evidence. Our evaluations across a diverse suite of state-of-the-art VLMs reveal a striking reasoning gap: while the Gemini-3.1-pro barely reaches a 60% accuracy threshold, representative open-source families struggle heavily in the 25%--45% range. These findings demonstrate that current VLMs still lack a stable "internal world model", positioning VidNum-1.4K as a demanding diagnostic testbed for the next generation of numerical video intelligence.

VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning

Abstract

Video-based numerical reasoning provides a premier arena for testing whether Vision-Language Models (VLMs) truly "understand" real-world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching. However, existing benchmarks are often confined to narrow domains, such as repetitive athletic motions, or treat simple counting merely as a superficial regression task, failing to assess multi-step numerical logic within the inherent complexity of real-world multimedia content. We introduce VidNum-1.4K, a comprehensive VideoQA benchmark comprising 1,379 strictly human-annotated video-question pairs designed to evaluate genuine numerical reasoning across highly diverse environments, encompassing object, action, and event quantification. The VidNum-1.4K is uniquely structured into a three-level hierarchy that evolves from direct visual perception to video-based compositional numerical reasoning, requiring models to perform arithmetic operations, comparisons, and logical deductions grounded in temporal evidence. Our evaluations across a diverse suite of state-of-the-art VLMs reveal a striking reasoning gap: while the Gemini-3.1-pro barely reaches a 60% accuracy threshold, representative open-source families struggle heavily in the 25%--45% range. These findings demonstrate that current VLMs still lack a stable "internal world model", positioning VidNum-1.4K as a demanding diagnostic testbed for the next generation of numerical video intelligence.

Paper Structure

This paper contains 15 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Statistics of the VidNum-1.4K benchmark. Left: distribution of video topics. Top-right: distribution of video durations. Bottom-right: distribution across question levels and categories.
  • Figure 2: Top: the data collection and annotation pipeline for VidNum-1.4K. Bottom: the evaluation pipeline on VidNum-1.4K.
  • Figure 3: Impact of zeroshot CoT prompting on selected VLMs across diverse task dimensions. (a) Mean accuracy gain/loss (in percentage points) across the three hierarchical levels of VidNum-1.4K. (b) Mean accuracy gain/loss categorized by counting targets (object, action, and event). The results indicate that while CoT facilitates high-level reasoning in Level 3 and Event tasks, it tends to hinder performance in lower-level perceptual counting.
  • Figure 4: Scaling trends of the InternVL3 series across VidNum-1.4K hierarchical levels. The left and right panels illustrate performance under NoCoT and CoT settings, respectively.