Table of Contents
Fetching ...

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, Fahad Shahbaz Khan

TL;DR

VideoMathQA introduces a multimodal, temporally extended benchmark for mathematical reasoning in real-world instructional videos. The dataset comprises 420 video–question pairs across 10 domains with 2,945 expert-annotated reasoning steps, capturing direct problem solving, concept transfer, and deep instructional comprehension. It provides high-resolution video, aligned subtitles, and audio, plus four rigorous evaluation strategies (MCQ, MBin, CoT, and step-wise evaluation) to diagnose intermediate reasoning and final answers. Findings show that success hinges on long-range cross-modal grounding, with model size, architecture, and reasoning capabilities influencing performance, and subtitles and frame-rich inputs providing measurable gains. This work delivers a systematic framework and dataset to push toward genuine temporal, multimodal mathematical reasoning.

Abstract

Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over $920$ man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

TL;DR

VideoMathQA introduces a multimodal, temporally extended benchmark for mathematical reasoning in real-world instructional videos. The dataset comprises 420 video–question pairs across 10 domains with 2,945 expert-annotated reasoning steps, capturing direct problem solving, concept transfer, and deep instructional comprehension. It provides high-resolution video, aligned subtitles, and audio, plus four rigorous evaluation strategies (MCQ, MBin, CoT, and step-wise evaluation) to diagnose intermediate reasoning and final answers. Findings show that success hinges on long-range cross-modal grounding, with model size, architecture, and reasoning capabilities influencing performance, and subtitles and frame-rich inputs providing measurable gains. This work delivers a systematic framework and dataset to push toward genuine temporal, multimodal mathematical reasoning.

Abstract

Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA

Paper Structure

This paper contains 15 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The foundation of our benchmark is the “needle-in-a-multimodal-haystack” challenge, capturing the core difficulty of cross-modal reasoning across time from visual, textual, and audio streams. Built on this, VideoMathQA categorizes each question along four key dimensions: reasoning type, mathematical concept, video duration, and difficulty.
  • Figure 2: Example questions from the VideoMathQA benchmark illustrating the three reasoning types: Problem Focused, Concept Transfer, and Deep Comprehension. The benchmark includes evolving dynamics in a video, complex text prompts, five multiple-choice options, the expert-annotated step-by-step reasoning to solve the given problem, and the final correct answer as shown above.
  • Figure 3: The figure illustrates a) Distribution of questions and model performance across ten mathematical concepts in the VideoMathQA. The consistently low performance across all concepts reveals a significant gap in the ability of the current multimodal models to perform mathematical reasoning over videos. b) Distribution of video durations in VideoMathQA, highlighting a diverse range from short clips of $10$s to long-videos up to $1$hr. c) The three-stage annotation pipeline for VideoMathQA was performed by expert science graduates, who annotated detailed step-by-step reasoning trails, with each stage governed by strict quality assessment.
  • Figure 4: The figure shows VideoMathQA performance a) Across video duration categories using the CoT MBin +Sub setting; b) Impact of subtitles under the CoT MBin setting; and c) Effect of varying the number of input frames under CoT MCQ setting. Overall, models perform best on medium-length videos, and overall accuracy improves with the inclusion of subtitles and more frames during evaluation.
  • Figure 5: The figure shows a) Comparison among vision-blind, image-only, and video models, highlighting the need for video-level understanding to perform well in VideoMathQA. b) Distribution of questions in VideoMathQA across three difficulty levels for varying reasoning depths, and the relationship between performance and question difficulty across top-performing models. c) Error analysis based on CoT step evaluation. Most model errors stem from misunderstanding the question, where models misinterpret what the question asks or overlook critical multimodal cues.